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ABSTRACT 


In this research, feature extraction and classification algorithms for high 
dimensional data are investigated. Developments with regard to sensors for 
Earth observation are moving in the direction of providing much higher 
dimensional multispectral imagery than is now possible. 

In analyzing such high dimensional data, processing time becomes an 
important factor. With large increases in dimensionality and the number of 
classes, processing time will increase significantly. To address this problem, a 
multistage classification scheme is proposed which reduces the processing 
time substantially by eliminating unlikely classes from further consideration at 
each stage. Several truncation criteria are developed and the relationship 
between thresholds and the error caused by the truncation is investigated. 

Next a novel approach to feature extraction for classification is proposed 
based directly on the decision boundaries. It is shown that all the features 
needed for classification can be extracted from decision boundaries. A novel 
characteristic of the proposed method arises by noting that only a portion of the 
decision boundary is effective in discriminating between classes, and the 
concept of the effective decision boundary is introduced. The proposed feature 
extraction algorithm has several desirable properties: (1) it predicts the 
minimum number of features necessary to achieve the same classification 
accuracy as in the original space for a given pattern recognition problem (2) it 
finds the necessary feature vectors. The proposed algorithm does not 
deteriorate under the circumstances of equal means or equal covariances as 
some previous algorithms do. In addition, the decision boundary feature 
extraction algorithm can be used both for parametric and non-parametric 
classifiers. 

Finally, we study some problems encountered in analyzing high 
dimensional data and propose possible solutions. We first recognize the 
increased importance of the second order statistics in analyzing high 
dimensional data. By investigating the characteristics of high dimensional data, 
we suggest the reason why the second order statistics must be taken into 
account in high dimensional data. Recognizing the importance of the second 
order statistics, there is a need to represent the second order statistics. We 
propose a method to visualize statistics using a color code. By representing 
statistics using color coding, one can easily extract and compare the first and 
the second statistics. 




CHAPTER 1 INTRODUCTION 


1.1 Background 

Advances in sensor technology for Earth observation make it possible to 
collect multispectral data in much higher dimensionality. For example, the HIRIS 
instrument now under development for the Earth Observing System (EOS) will 
generate image data in 192 spectral bands simultaneously. In addition, multi- 
source data also will provide high dimensional data. Such high dimensional 
data will have several impacts on processing technology: (1) it will be possible 
to classify more classes; (2) more processing power will be needed to process 
such high dimensional data; and (3) feature extraction methods which utilize 
such high dimensional data will be needed. 

In this research, three main subjects are studied: fast likelihood 
classification, feature extraction, and the characteristics of high dimensional 
data and problems in analyzing high dimensional data. 

The analysis of remotely sensed data is usually done by machine 
oriented pattern recognition techniques. One of the most widely used pattern 
recognition techniques is classification based on maximum likelihood (ML) 
assuming Gaussian distributions of classes. A problem of Gaussian ML 
classification is long processing time. This computational cost may become an 
important problem if the remotely sensed data of a large area is to be analyzed 
or if the processing hardware is more modest in its capabilities. The advent of 
the future sensors will aggravate this problem. As a result, it will be an important 
problem to extract detailed information from high dimensional data while 
reducing processing time considerably. 
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1 Introduction 


Feature extraction has long been an important topic in pattern 
recognition and has been studied by many authors. Linear feature extraction 
can be viewed as finding a set of vectors which effectively represent the 
information content of an observation while reducing the dimensionality. In 
pattern recognition, it is desirable to extract features which are focused on 
discriminating between classes. Although numerous feature extraction/selection 
algorithms have been proposed and successfully applied, it is also true that 
there are some circumstances where the previous methods do not work well. In 
particular, if there is little difference in mean vectors or little difference in 
covariance matrices, some of the previous feature extraction methods fail to find 
a good feature set. 

Although many feature extraction algorithms for parametric classifiers are 
proposed, relatively few feature extraction algorithms are available for non- 
parametric classifiers. Furthermore, few feature extraction algorithms are 
available which utilize the characteristics of a given non-parametric classifier. 
As use of non-parametric classifiers such as neural networks to solve complex 
problems increases, there is a great need for an effective feature extraction 
algorithm for non-parametric classifiers. 

In dealing with high dimensional data, there will be problems which have 
not been encountered in analyzing relatively low dimensional data. In order to 
realize the full potential of high dimensional data, it is necessary to understand 
the characteristics of high dimensional data. One of these characteristics is the 
increased importance of the second order statistics. Although some classifiers, 
e.g., as a minimum distance classifier utilizing only first order statistics, often 
perform relatively well on low dimensional data, it is observed that classifiers 
utilizing only first order statistics show limited performance in high dimensional 
space. Further, information contained in the second order statistics plays an 
important role in discriminating between classes in high dimensional data. We 
will illustrate this problem and investigate the reasons for it by examining the 
characteristics of such high dimensional data. 

More detailed background and related works on each of these subjects 
will be discussed at the beginning of each chapter. 
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1 Introduction 


1 .2 Objective of Research 

It is the objective of this research to better understand the characteristics 
of high dimensional data relative to the analysis process, and to create 
algorithms which increase the feasibility of its use. 

In order to utilize the discriminating power of high dimensional data 
without increasing processing time significantly, a fast likelihood classification 
algorithm based on a multistage scheme is proposed. At each stage, unlikely 
classes are eliminated from further consideration, thus reducing the number of 
classes for which likelihood values are to be calculated at the next stage. 
Several truncation criteria are developed and the relationship between such 
truncation and the error increased is investigated. 

Another objective of this research is to develop a feature extraction 
algorithm which better utilizes the potential of high dimensional data. The 
proposed feature extraction algorithm is based directly on the decision 
boundary. By directly extracting feature vectors from the decision boundary 
without assuming any underlying density function, the proposed algorithm can 
be used for both parametric and non-parametric classifiers. The proposed 
algorithm also predicts the minimum number of features needed to achieve the 
same classification accuracy as in the original space for a given problem and 
finds all the needed feature vectors. In addition, the proposed algorithm does 
not deteriorate under the circumstances of equal means or equal covariances 
as some previous algorithms do. 

It is a further objective of this research to investigate and understand the 
characteristics of high dimensional data. Problems in applying to high 
dimensional data some analysis techniques which were primarily developed for 
relatively low dimensional data are studied. In particular, the increased role of 
second order statistics in analyzing high dimensional data is examined. 
Although most analysis and classification of data are conducted by machine, 
sometimes it is helpful and necessary for human to interpret and analyze data. 
However, as the dimensionality grows, it becomes increasingly difficult for 
human extraction of information from numerical values. In order to overcome 
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1 Introduction 


this problem, a visualization method is proposed using a color coding scheme. 
In this method, the correlation matrix of a class is displayed using a color code 
along with the mean vector and the standard deviation. Each color represents a 
degree of correlation. 


1 .3 Research Organization 

In Chapter 2, the fast likelihood classification algorithm is presented for 
high dimensional data. A method to avoid redundant calculations in multi-stage 
classification is proposed. Several truncation criteria are developed and the 
relationship between truncation and truncation error is investigated. 
Experimental results are presented and compared. In Chapter 3, after reviewing 
various feature extraction algorithms, the decision boundary feature extraction 
algorithm is developed. After several new concepts are defined, all the needed 
equations are derived. A decision boundary feature extraction procedure for 
parametric classifiers is proposed and experimental results are presented. In 
Chapter 4, the decision boundary feature extraction algorithm is extended to 
non-parametric classifiers. In Chapter 5, the decision boundary feature 
extraction algorithm is applied to a neural network. In Chapter 6, discriminant 
feature extraction, which is a generalization of the decision boundary feature 
extraction, is presented. In Chapter 7, problems encountered in analyzing high 
dimensional data are studied and the characteristics of high dimensional data 
are investigated. In Chapter 8, conclusions are summarized and suggestions for 
future work are presented. Proofs of theorems, color pictures, and programs are 
presented in appendices. 


CHAPTER 2 FAST LIKELIHOOD CLASSIFICATION 


2.1 Introduction 

Earth observing systems such as the LANDSAT MSS and Thematic 
Mapper have played a significant role in understanding and analyzing the Earth 
resources by providing remotely sensed data of the Earth surface on a regular 
basis. The analysis of remotely sensed data is usually done by machine 
oriented pattern recognition techniques. One of the most widely used pattern 
recognition techniques is classification based on maximum likelihood (ML) 
assuming Gaussianly distributions of classes. A problem of ML Gaussian 
classification is long processing time. This computational cost may become an 
important problem if the remotely sensed data of a large area is to be analyzed 
or if the processing hardware is more modest in its capabilities. The advent of 
future sensors, for example HIRIS (High Resolution Imaging Spectrometer) 
(Goetz and Herring 1989), which is projected to collect data in many more 
spectral bands will aggravate this problem. 

In this chapter, we propose a multistage classification procedure which 
reduces the processing time substantially while maintaining essentially the 
same accuracy. The proposed multistage classification procedure is composed 
of several stages, and at each stage likelihood values of classes are calculated 
using a fraction of the total features. This fraction increases as stages proceed. 
Classes which are determined to be unlikely candidates by comparing 
likelihood values with a threshold are truncated, i.e., eliminated from further 
consideration so that the number of classes for which likelihood values are to 
be calculated at the following stages is reduced. Depending on the number of 
features and the number of classes, the processing time can be reduced by the 

factor of 3 to 7. 
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2 Fast Likelihood Classification 


2.2 Related Works and Background 

Processing time has been an influential factor in designing classifiers for 
the analysis of remotely sensed data. Even with the data from the previous 
sensors such as MSS and TM, for which the numbers of spectral bands are 4 
and 7, respectively, the cost of the analysis of even a moderate area was 
considerable. Future sensors such as HIRIS which will collect data in 192 
spectral bands at 30 m spatial resolution, will aggravate this problem. 

Efforts to reduce processing time have been pursued in various ways. By 
employing feature selection/extraction algorithms [(Muasher and 
Landgrebe,1983), (Duchene and Leclercq, 1988), (Riccia and Shapiro, 1983), 
(Eppler 1976) and (Merembeck and Turner, 1980)], the number of features can 
be reduced substantially without sacrificing significant information. Feature 
selection/extraction is generally done by removing redundant features or by 
finding new features in transformed coordinates. This reduction in the number of 
features has several advantages. First of all, higher accuracies can be achieved 
in cases where the number of training samples is low, due to the Hughes 
phenomenon (Hughes 1968). Since generally processing time increases with 
the square of the number of features, a benefit of feature selection/extraction is 
reduction in processing time. 

Another possible approach to reduce computing time can be found in 
decision tree classifiers [(Swain and Hauska 1977), (Chang and Pavlidis 1977), 
and (Wang and Suen 1987)]. Though the decision tree classifier can have 
several advantages depending on the situation, one of the advantages is 
processing time. For instance, in an ideal binary decision tree classifier, the 
computing time will be proportional to ln(M) instead of M where M is the number 
of classes, assuming the same number of features is used at each node. 
However, how to find the optimum tree structure still remains a problem for the 
decision tree classifier, though many algorithms are proposed for the design of 
decision tree classifiers (Argentiero et al., 1982). 

Feiveson (1983) proposed a procedure to reduce computing time by 
employing thresholding. In his algorithm, the most likely candidate class of a 
given observation is selected based on some prediction, and its probability 
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density function is calculated. If the probability density function is greater than a 
threshold, calculation of the probability density functions for the other classes is 
omitted, resulting in reduction of computing time. If it is possible to make a good 
prediction, the computing time can be reduced significantly. But a problem of 
this method is that its performance depends on the accuracy of predictions, 
especially when many classes are involved. 

Wald’s sequential probability ratio test (SPRT) provides another 
perspective (Wald, 1947). In Wald’s sequential probability ratio test, the 
sequential probability ratio 


_ P n (X|co 1 ) 

" P„(X|o) 2 ) 

is computed where p (X|g>,) is the conditional probability density function of X for 
class o^and n denotes the number of features. Then the likelihood ratio, K> is 

compared with two stopping boundary A and B. If 

\ > A, then it is decided that X ~ g > 1 
X n < B, then it is decided that X ~ co 2 

If B < X n < A, an additional feature will be taken and likelihood ratio will be 

compared with the additional feature. The error probabilities are related to the 
two stopping boundaries by the following expressions. 


1 - e 


A =- 


21 


and B = 


21 

1 - e< 


*12 ' "12 
where ey is the probability of deciding X ~ coj when X ~ Oj is true. 


A sequential probability ratio test was applied to pattern classification by 
Fu [(Fu 1962) and (Chien and Fu 1966)]. When the cost of a feature measure is 
high or features are sequential in nature, the sequential classification proved 
useful. Although the sequential classifier achieves the desired accuracy with the 
minimum number of features, the processing time of the sequential classifier 
may not be reduced proportionally due to the repeated calculation of the 
probability density function. 
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Generally it is true that the processing time of a classifier increases as the 
number of features increase. In the Gaussian ML classifier, for instance, the 
processing time is proportional to the square number of features. Therefore it is 
possible to reduce the processing time considerably by exploiting the property 
of the SPRT that the decision is reached with the lowest possible number of 
features if the redundant computations caused by repeated calculation of 
likelihood values can be avoided. There is, however, another problem in the 
straightforward application of SPRT to pattern recognition where there are more 
than two classes. The general relationship between stopping boundaries and 
the optimum property of SPRT remains to be understood if there are more than 

two classes. 

The SPRT does not take into account the separability of two classes. If 
the separability of classes is taken into account, a decision may be reached 
sooner. Considering two cases of a two-class classification problem (Figure 
2.1), it is observed that errors in case I are smaller than errors in case II even 
though the same stopping boundaries are used for both classes. Thus for the 
same error tolerances, the stopping boundaries for case I can be less strict than 

case II. 



Figure 2.1 A hypothetical example where classes are more separable in 

case I than those in case II. 


In this chapter, an algorithm is proposed which avoids the redundant 
calculations of SPRT so that the characteristics of SPRT which classifies with 
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the lowest possible number of features can be exploited in such a way that the 
decision can be made with considerably less processing time [(Lee and 
Landgrebe 1990), (Lee and Landgrebe 1991-3)]. It is noted that the proposed 
multistage classifier is different from the Wald’s sequential classifier in that the 
number of stages of the multistage classifier is considerably smaller than the 
number of features, while the sequential classifier has essentially the same 
number of stages as the number of features. Also the criteria for truncation are 
different. We also address the case where there are more than two classes. 
Though some error is inevitably introduced by truncation, the error is minimal 
and can be constrained within any specified range. Most of the samples which 
cause error are found to be outliers. The relationship between truncation and 
error caused by the truncation is investigated. 


2.3 Multistage Gaussian Maximum Likelihood Classification 

In the conventional Gaussian ML classifier, a discriminant function is 
calculated for all classes using the whole feature set and the class which has 
the largest value is chosen as the classification result of an observation X. 

X e coi if gj(X) > gj(X) for all j * i 


where g { is the discriminant function is given by 

g.(X) = -m|E.J - (X-Mj/LiVx-M.) (2.1) 

where Zj is the covariance matrix of class coj and Mj is the mean vector of class 
Oj. The discriminant function is essentially the log likelihood value. 


In the proposed multistage classifier, at each intermediate stage only a 
portion of the features is used to calculate the discriminant function, and the 
classes whose discriminant function values are less than a threshold are 
truncated. At the final stage the whole feature set is used. The block diagram of 
the multistage classifier is shown in Figure 2.2. 
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Figure 2.2 Multistage classifier. 

By truncating unlikely classes at early stages where only a small portion 
of the whole feature set is used, it is possible to reduce the number of classes 
for the later stages where more features are to be used. Therefore if it is 
possible to truncate a substantial number of classes at early stages, the 
processing time can be reduced substantially. However, there are several 
problems to be addressed. Since the discriminant function must be calculated 
repeatedly at each stage, additional calculations are inevitably introduced. Thus 
truncation alone does not guarantee less processing time. Another problem is 
to develop criteria for truncation. The successful application of the multistage 
classifier in reducing processing time depends on how accurately and early a 
class can be truncated with little risk of introducing truncation error. 


2.3.1 Additional Calculations in a Multistage Classifier 

Suppose an N-stage classifier where n features are used at the n th stage 
and N is the total number of features. A possibility to avoid unnecessary 
calculations is to use the discriminant function values from the current stage in 
calculating the discriminant function values for the next stage. The most time 
consuming part in calculating the discriminant function (equation 2.1) is matrix 
multiplication. Therefore to avoid the additional calculation, one would like to be 
able to use 


<x„- M„)> e;’ <X„- M„) 
(X nt , (X n+1 — M n+1 ) 


in calculating 
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where the subscript denotes the number of features, and X, M are as in 
equation (2.1). 

Tl' 1 pi 

This can not be done easily since Xn +1 ^ n t where X n is the covariance 

- r a - 

matrix of n features, p is a column vector, and X n+1 is the covariance matrix of 
(n+1) features even though X n+1 = ^ n t ^ where u is a column vector. But it can 

2 u 

be shown that if X n is invertible and (a-u'S^u) is not zero, then £ n+1 = ^ 

is also invertible and 



In +aXn UU'In -al n U 
-au'Ip cc 


where a = — T~ 

a-u l I u 


Without loss of generality, we can assume that the mean vector is zero. Then 
(X n+1 — M n+1 ) t In 1 +1 (X n+1 -M n+1 ) 

t In +0tln U U l I n -alpU TX" 

= [ X n.X n+ 1 ] [ .^-1 a JLx n+1 . 

= X , n (I n 1 +aI n , uu‘In 1 )X n - 2ax n+1 x‘lnU + ax^ +1 

= X nln X n + a(U l In X n )(U t InX n ) - 2ax n+1 UlnX t n + OX* +1 

= X pin X n + d[(U l In X n ){ ( U l In X„) - 2x n+1 } + X„ +1 ] (2.2) 

Considering equation (2.2), the value ofXnI n X n is known from the current 

stage and utlpcan be calculated once at the beginning and saved. Therefore 
the number of multiplications and additions required to calculate (X n+1 - M n+1 ) 1 

£n +1 (X n+1 - M n+ i) is (n+3) and (n+4), respectively. Thus the total number of 

multiplications and additions for a class which passes all the truncation tests 
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and reaches the final stage are given by equation (2.3.1) and equation (2.3.2), 
respectively. 


N -1 1 c 

£(n+3) =^2+^-3 

n-1 



(2.3.1) 


N-1 i 7 1 

^(n+4) =2N 2 +2"N-3 « 
n-1 


(2.3.2) 


On the other hand, the total number of multiplications and additions of the 
conventional single stage Gaussian ML classifier are given by equation (2.4.1) 
and equation (2.4.2), respectively 


M + n = 1 n2+ |n.|n 2 

(2.4.1) 


(2.4.2) 


Comparing equations (2.3.1 -2) and equations (2.4.1 -2), it can be seen 
that the total number of multiplications and additions of both methods are about 
of the same order and the multistage classifier does not introduce significant 
additional calculations. 


2.3.2 Truncation by Absolute Region 

Since unlikely classes are to be truncated at each stage in the multistage 
classifier, a criterion for truncation must be developed. Along with the criteria for 
truncations, the relationship between truncation and error caused by truncation 
must also be understood and quantized. 

One possible way for truncation is to find the smallest region, Qj, for class 
co. which contains a certain portion, P t , of class tOj, and to check whether a test 

sample is within that region. If the data classes are assumed to have Gaussian 
distributions, the smallest region for a class will be a hyperellipsoid which has 
its origin at the mean of the class and whose semi-axes are in the directions of 
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the eigenvectors of the covariance of the class with lengths proportional to the 
corresponding eigenvalues. If a test sample is found outside region Qj, class coj 
can be truncated with the risk of error 1-P,. For example in Figure 2.3, class 1 

can be truncated as an unlikely class with risk of error 0.001. 



Figure 2.3 A hypothetical distribution of 3 classes. 

Finding the smallest region Qj for class co 5 is equivalent to finding such that 

Pr{X| (X - Mj) l S j 1 (X-M j )<rJ} = P t 

where Mj and Z. are the mean vector and the covariance matrix of 
class G)j. 

The smallest region £1 ( is given by 

Q. = {X | (X - Mj)' I j 1 (X - Mj) < Iq } (2-5) 

The quantity (X - NT) 1 Z'- (X - Mj) is the so-called Mahalanobis distance. It is 
noteworthy that r^ does not depend on M and I but depends solely on n, the 
dimensionality. 

Pr n {X| (X-M^X-M)^ 
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' exp(4(X - - M)}dX 

J (2*)Nifi 

(X-M)'l' 1 (X-M)<ro 

where the subscript n in Pr n denotes the number of elements in X. 

The quantity r= V (X-M) 1 ^" 1 (X— M), is a chi statistic, and Pr n {X| (X— M) 
< r^} is given by 

Pr n (X| (X-M)'r'(X-M) £ ^ 

TO 

- Pr n {r|r 2 < - C n -4 j r^V^dr-P, (2.6) 

{ 2k Y o 

n 

2 ? n 

where C n =r — and r^) is a gamma function. 

Therefore for a given threshold probability P t , one can find r Q by solving 
equation (2.6) and the region for class coj is given by equation (2.5). An 

advantage of the above method of truncation is that the truncation can be 
performed on an absolute basis. In other words, the truncation can be 
performed by calculating the likelihood value of a class, and no information 
about the other classes is required. It is noted that checking truncation by the 
above method does not require any additional computation. It can be performed 
as a part of calculating the discriminant function (equation 2.1). Figure 2.4 
shows the flowchart of the multistage classifier where class co, is truncated if a 

test sample is found outside region Q. [ containing a prescribed portion of class 
(a. 

i 
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Figure 2.4 Flowchart of the multistage classifier. 


2.3.3 Truncation by Likelihood Ratio 

Another possibility for a truncation criterion is by likelihood ratio, or, 
equivalently, by difference of the log likelihood values. Since in likelihood 
classifiers classification is based upon the relative size of the likelihood values, 
classes which have relatively low likelihood values compared with class col, the 
class having the largest likelihood value, when only a fraction of the whole 
feature set is used, would be expected to have lower likelihood values relative 
to class col when the whole feature set is used. Thus such classes could be 
truncated at an intermediate stage with little risk of error. To be more precise, at 
each stage of the multistage classifier the discriminant function (equation 2.1) 
which is equivalent to the log likelihood value of each class is computed. If g*(x) 
< T, then class coj is truncated, where T is a threshold and determined by 


T = L - D where L = max(gj(x), i =1 ,m) 






2 Fast Likelihood Classification 


where m is the number of classes and D is a difference to be 
selected by the user. 


In this case, it is important to understand the relationship between the threshold 
and the error increment caused by the truncations. We next derive an upper 
bound on the error increment caused by the truncation. 


In a two class classification problem using a Bayes' decision rule with the 
[0,1] loss function case, the decision is made by the following rule (Fukunaga, 
1990). 

p(X|co 1 )P(co 1 )> p(X|ca 2 )P(co 2 ) X ~ co 1 
p(X|co 1 )P((o 1 ) < p(X|co 2 )P(o 2 ) X ~ o) 2 


The error probability is given by 


e = P(to 1 )e 1 + P(co 2 )e 2 

oo 

where e 1 = Jp(h|co 1 )dh and e 2 = 


t 


t 

P(c*V 

‘■"W 


Jl 


p(h|a)Jdh 


h(X) = log 


P(Xjo> 2 ) 

p(X]co 1 ) 


The quantities e 1 and e 2 are bounded by 


e 1 <exp[-p(s)-st] fp g (g = hlto^dh 
t 


< exp[ -p(s) -st ] 


e 2 < exp[ -|i(s) +(1-s)t ] Jp g (g=h|co 1 )dh 

-CO 

< exp[ -|i(s) +(1-S)t] 
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where p (g**h|a> 1 ) = exp[sh+ p(s)]p(h|co 1 ) 

+oo 

p(s)= -In (p^s) = -In Jexp(sh)p(h|co 1 )dh 

-oo 

p(&v 
1 - m-t-4 

P(Wo) 


ji(s) is obtained by taking the minus logarithm of <p-, (s) which is the moment 
generating function of h(X) for co^ and p g {g=h|(D 1 ) is a probability density 


function. In the case of normal distributions, an explicit mathematical expression 
for p.(s) can be obtained. 


p(s) =^=^M 2 - M 1 ) t [sS 1 +(1-s)2: 2 r 1 (M 2 -M 1 ) + ^ln 


|s£ 1+ (1-s)£ 0 l 

IV+ISJ 1 - 


The term |i(^) is called the Bhattacharyya distance (Fukunaga, 1990) and 

is used as a measure of the separability of two classes. The Bhattacharyya 
distance gives an upper bound for the Bayes' error in the case of normal 
distributions. By moving the decision boundary, one can reduce the omission 
error arbitrarily for a specific class even though the overall error may increase. 

In the similar way, an upper bound on incremental error of the multistage 
classification with likelihood value truncation can be obtained. Assume class co, 
has the largest log likelihood value L at the n th stage. Class tOj is truncated if 


p(o>.) 

In p n (X|co.) + ln{ — ; — +Dn < In p n (X|co | ). 
1 p(co,) 

The truncation error of class tOj is bounded by 


P(“i) 


e" < exp[ -pj(s) - stj) ] where tj = ,n { p ^}+ D ! 


(2.9) 
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where p[J(s) is the J-M distance of class cOj and class to, and Dj? is 
an offset value of class cjOj and class co, and superscript n denotes 
the number of features. 

It is noteworthy that the truncation boundary is moved so that truncation 
error is reduced. Therefore by adjusting D|J which depends on classes g)|, co, and 
the number of features, the errors caused by truncations can be constrained 
within any given error limit e Q . 

Figure 2.5 shows the flowchart of the multistage classifier where 
truncation is done by the differences of log likelihood values. In this example, 
the number of features is increased by the same amount from stage to stage. 


2.3.4 Upper Bound on the Total Incremental Error Probability of Multistage 
Classifier 


Assume that there are M classes and N stages without counting the final 
stage. The total error increment, e incre , caused by the truncations can be viewed 
as the accumulation of truncation error at each stage and can be formulated as 


m n 

^incre — ^ P(®i) ^ ^*ij 

1*1 j as 1 

where 

Tj. : Probability that class coj is truncated at j th stage. 

M : The number of classes. 

N : The number of stages without counting the final stage. 


( 2 . 10 ) 
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Figure 2.5 Flowchart of the multistage classifier where the truncation is 
done by the differences of log likelihood values. 


If the truncation is done by absolute regions and P, is the threshold probability, 
T.. does not depend on the number of classes and is given by 

T, - O-PJR, 

where is the probability that the samples of class cOj which are 
truncated at the j th stage have not been truncated until G~1) th 
stage. 

From the definition of R.., it can be easily seen that R M is 1. In the desirable case, 
Rij would be zero except R i1( i.e., all classes to be truncated are truncated at the 
first stage. And the total error increment is given by 
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M |M 

einc sX P(“i) X < 1 - P '> R i| 


j-1 


M 


= (1-p,)lp(ui) = (1-p,) 

i»1 


In the worst case, Ry would be 1 . In other words, the truncation error of class coj 
at each stage is accumulated without any overlap. Then the total error 
increment is given by 


M N 

Erne,, sIPWXH-P,)Rii 

'-1 j-1 

M 

= N(1-P t ) X P(°>i) = N(1-Pt) 

i«1 


Thus even in the worst case which is very unlikely, the total error increment is 
bounded by 


Eincre * N(1-P t ) 

A typical number of intermediate stages in a multistage classifier would 
be 3 to 5. Thus by carefully choosing P,, it is possible that the total error 
increment due to truncation can be constrained within any specified range while 
achieving a substantial reduction in processing time. In practice, the average 
value of Ry would be much less than 1. In addition, since most of the samples 
truncated in intermediate stages would be misclassified at the final stage, the 
actual error increment due to truncation would be much smaller. 

If the truncation is done by likelihood ratio and e Q is an error limit, T. depends on 
the number of classes and can be formulated as 
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where R.. is the probability that the samples of class © s which are 

truncated at j th stage have not truncated until the (j— 1) th stage, and 
q|- is the probability that the samples of class ©; which are 
truncated by class co k at the j th stage have not truncated by other 
classes at the j ,h stage. 

In the worst case, which is very unlikely, q|| and Rjj would be 1 . In other words, 
the truncation error of class ©j by the other classes are accumulated without any 
overlap at the j th stage, and the truncation error of class ©j at each stage is also 
accumulated without any overlap. In the worst case, the total error increment is 
bound by 

M N M . 

<W. s I P«»i> z R i| £ So Q* S (M-1 )Ne« 

i-1 j-1 k=1 k*i 

Therefore, even in the worst case, it is possible that the total error 
increment due to truncation can be constrained within any specified range by 
adjusting e Q . In addition, it is observed that even a significant difference in e 0 

results in a minor difference in computing time. Moreover, in real data, the 
average values of o|}and Rjj would be much less than 1, though the values 
depend on the characteristics of data. In addition, since most of the samples 
truncated in intermediate stages would be misclassified at the final stage, the 
actual error increment due to truncation would be much smaller. Thus by 
carefully choosing e 0 , it is possible that the total error increment due to 

truncation can be constrained within any specified range while achieving a 
substantial reduction in processing time. 


2.4 Experiments and Results 

Tests were conducted using FSS (Field Spectrometer System) data 
which has 60 spectral bands (Biehl et al., 1982) The major parameters of FSS 
are shown in Table 2.1. The data are multi-spectral and multi-temporal. 
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Table 2.1 Parameters of Field Spectrometer System. 


Number of Bands 

60 

Spectral Cover 

0.4 - 2.4 um 

Altitude 

60 m 

IFOV(qround) 

25 m 


Figure 2.6 shows the relationship between accuracy and the number of 
features for a 40-class classification using a conventional Gaussian ML 
classifier. A total of 13,033 data points were used. Half of the data were used for 
training and the other half were used for test. From Figure 2.6, it can be seen 
that accuracies increase as the number of features increases. Though this 
demonstrates clearly the discriminating power of high dimensional data, the 
computation cost is also high. The proposed multistage classifier can be 
successfully employed in such circumstances, in particular for high dimensional 
and numerous class cases. 



Number of Features 

Figure 2.6 Accuracy vs. number of features in a 40-class classification problem. 

Two tests were conducted to evaluate the performances of the proposed 
algorithm. The machine used was a CCI 3/32. The number of classes were 12 
and 40 and the numbers of data were 6668 and 1 3,033, respectively. Half of the 
data was again used for training and the other half for test. The number of 
features was reduced to 28 and 26, respectively, using the algorithm proposed 
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by Chen and Landgrebe (1989). Six classifiers were tested for each data set in 
order to evaluate the performances of the proposed multistage classifiers. The 
first one was the conventional single stage Gaussian ML classifier. The next two 
are the multistage classifiers where truncation is done by absolute region 

Two threshold probabilities, 99.9% and 99%, were tested. The last three were 
multistage classifiers where truncation was done by the difference of the 
discriminant function values for Eo=0.001, 0.005 and 0.01. The number of stages 

of the tested multistage classifier was 4 in all cases and the number of features 
used at the first stages was 5, 10 and 15, respectively. The whole feature set 
was used at the final stage. 

Figure 2.7 shows the performance comparison for the case of 12 classes. 
The computing time of the conventional single stage Gaussian ML classifier, 
Cl, was 117 seconds with an accuracy of 95.2%. The computing time of the 
multistage classifier, C2, where truncations were done by absolute region Qj 
with the threshold probability, P t =99.9% was about 31 seconds with an 

accuracy of 94%; the computing time of the multistage classifier, C3, with the 
threshold probability, P,= 99% was 25 seconds with the accuracy of 92.7%. 

Comparing classifier Cl with classifier C2 and C3, the processing times of 
multistage classifiers Cl and C2 were 21-27% of that of the single stage 
classifier Cl with error increased by 1.2% and 2.5%, respectively. 
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♦ 


Accuracy 
Processing Time 



Figure 2.7 Classifier performance comparison for the 12-class case. 


The classifiers are as follows: 

Cl Single Stage Gaussian ML Classifier. 

C2 Multistage Classifier. Truncation by absolute region. P=99.9%. 

C3 Multistage Classifier. Truncation by absolute region. P t =99%. 

C4 Multistage Classifier. Truncation by the difference of the 
discriminant function values. £ 0 = 0 . 001 . 

C5 Multistage Classifier. Truncation by the difference of the 
discriminant function values. £o=0.005. 

C6 Multistage Classifier. Truncation by the difference of the 
discriminant function values. e 0 =0.01 . 


On the other hand, the computing times of the multistage classifiers 
where truncation was done by the difference of the discriminant function values 
fore 0 =0.001, 0.005 and 0.01 were 21, 18 and 17 seconds, respectively with 

accuracies of 94.7%, 94% and 93.2%. It is observed that the processing times 
were reduced by the factor of 5.6 to 6.9 while errors increased by 0.5%, 1.2% 
and 2%, respectively. Table 2.2 shows accuracies and error increments for 
individual classes due to the truncation. It can be seen that the error increments 
due to the truncation are evenly distributed and no particular class is sacrificed. 
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Table 2.2 Accuracies and error increments for individual classes 
due to truncation. Minus signs in the "Error Incre." rows 
indicate that accuracies increased. 


mamm 

CM 

CI.2 

Cl. 3 

CI.4 

CI.5 

CI.6 

0.7 

0.8 

Cl. 9 

CLIO 

Cl.11 

CM2 

Ave. 


97.4 

99.1 

96.7 

96.7 

96.3 

72.6 

98.1 

99.2 

91.0 

98.3 

98.7 

97.3 

95.2 

| 

95.4 

97.6 

96.1 

95.1 

96.6 

73.7 

96.1 

96.6 

89.7 

96.1 

97.4 


94.0 


BEX 

1.5 


1.5 


-1.2 

1.9 


1.3 


1.3 

El 

1.2 



96.5 

93.7 

93.3 

95.3 

76.8 

94.6 

95.4 







4.1 

wm 



■B 

EB 

3.5 

3.8 

Bli 

BB 

BB 

BB 

BB 

g 

96.2 

98.2 

95.8 

96.4 

96.3 

77.2 

96.1 

97.9 

89.7 






1.2 



■B 

BEX 

BB 

1.9 

1.3 

1.3 

KB 


BB 



95.9 

97.0 

93.7 

93.3 

96.3 

82.2 



94.6 

97.5 

88.0 

96.1 

97.8 

93,6 

EXE 

^ g 

1.4 

2.1 

BEX 

bb 

KB 

BB 

3.5^ 

1.7 

3.CT 


BB 

gey 

l.Z 


95.1 

97.0 

92.7 

90.0 

96.6 

85.3 

91.9 

96.2 

87.6 

95.7 

97.8 

91.8 

93.2 

g 

El 

2.1 

BEX 

HB 

BB 

-12.7 

6.2 

3.0 

3.4 

2.6 

BB 




Figure 2.8 shows the performance comparison for the classification with 
40 classes. The computing time of the conventional single stage Gaussian ML 
classifier, Cl , was 655 seconds with an accuracy of 79.4% while the computing 
times of the multistage classifier, C2 and C3 was 188 and 155 seconds with 
accuracies of 78.3% and 76.8%, respectively. Comparing classifier Cl with 
multistage classifiers C2 and C3, the processing times of multistage classifiers 
Cl and C2 were 24% and 29% of that of the single stage classifier Cl with an 

error increase of 1.1% and 2.6%. 

On the other hand, the computing times of multistage classifiers, C4, C5 
and C6, were 123, 103 and 93 seconds with accuracies of 78.7%, 77.9% and 
77.4%, respectively. It is observed that the processing times were reduced by 
factor of 5.3 to 7.0 while errors increased by 0.7%, 1.5% and 2%, respectively. 
In particular, comparing Cl and C4, the processing time for 40 classes was 
reduced from 652 seconds to 123 seconds, a factor of more than 5, while the 
accuracy decreased from 79.4% to 78.7%. 

In most applications, such error increments due to truncations would be 
acceptable. It is also observed that most of the test samples which caused 
truncation error are found at boundaries and may be truncated if a chi threshold 
is applied. In other words, the results for such test samples are not reliable nor 































































































































2 Fast Likelihood Classification 


critical. In addition, the error tolerance can be adjusted depending on the 
requirement of the application. 



Figure 2.8 Classifier performance comparison for the 40-class case. 


2.5 Conclusion 

It is shown that the computing time can be reduced by a factor of 3 to 7 
using the proposed multistage classification while maintaining essentially the 
same accuracies when the Gaussian ML classifier is used. Although the 
proposed algorithm was developed on the assumption of Gaussian ML 
classifier, the relationship between threshold and the error increments are 
derived without the assumption of Gaussian ML classification. Thus the 
proposed algorithm can be used for other classification algorithms if an 
algorithm to avoid the repeated calculation is developed. Therefore after 
selecting features which depend on an accuracy requirement, the processing 
time could be reduced substantially without losing any significant accuracy by 
employing the multistage classifiers, particularly for high dimensional data. 
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CHAPTER 3 DECISION BOUNDARY FEATURE EXTRACTION 


3.1 Introduction 

Linear feature extraction can be viewed as finding a set of vectors that 
represent an observation while reducing the dimensionality. In pattern 
recognition, it is desirable to extract features that are focused on discriminating 
between classes. Although a reduction in dimensionality is desirable, the error 
increment due to the reduction in dimensionality must be constrained to be 
adequately small. Finding the minimum number of feature vectors which 
represent observations with reduced dimensionality without sacrificing the 
discriminating power of classifiers along with finding the specific feature vectors 
has been one of the most important problems of the field of pattern analysis and 
has been studied extensively. 

In this chapter, we address this problem and propose a new algorithm for 
feature extraction based directly on the decision boundary. The algorithm 
predicts the minimum number of features to achieve the same classification 
accuracy as in the original space; at the same time the algorithm finds the 
needed feature vectors. Noting that feature extraction can be viewed as 
retaining informative features or eliminating redundant features, we define the 
terms "discriminantly informative" feature and "discriminantly redundant" 
feature. This reduces feature extraction to finding discriminantly informative 
features. We will show how discriminantly informative features and 
discriminantly redundant features are related to the decision boundary and can 
be derived from the decision boundary. We will need to define several terms 
and derive several theorems and, based on the theorems, propose a procedure 
to find discriminantly informative features from the decision boundary. 
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3.2 Background and previous works 

Most linear feature extraction algorithms can be viewed as linear 
transformations. One of the most widely used transforms for signal 
representation is the Karhunen-Loeve transformation. Although the Karhunen- 
Loeve transformation is optimum for signal representation in the sense that it 
provides the smallest mean square error for a given number of features, quite 
often the features defined by the Karhunen-Loeve transformation are not 
optimum with regard to class separability (Malina 1987). In feature extraction for 
classification, it is not the mean square error but the classification accuracy that 
must be considered as the primary criterion for feature extraction. 

Many authors have attempted to find the best features for classification 
based on criterion functions. Fisher's method finds the vector that gives the 
greatest class separation as defined by a criterion function (Duda and Hart 
1973). Fisher's linear discriminant can be generalized to multiclass problems. In 
canonical analysis (Richards 1986), a within-class scatter matrix E w and a 
between-class scatter matrix L b are used to formulate a criterion function and a 
vector d is selected to maximize 

d'S b d 

d'L w d 

where 

£w = £P(G)j)£j (within-class scatter matrix) 

l 

= £P(coj)(M r M 0 )(M r M 0 )‘ (between-class scatter matrix) 

i 

M 0 = XP(cOi)Mi 

Here Mj, Ij , and P(coj) are the mean vector, the covariance matrix, and the prior 
probability of class co,, respectively. Although the vector found by canonical 
analysis performs well in most cases, there are several problems with canonical 
analysis. First of all, if there is little or no difference in mean vectors, the feature 
vector selected by canonical analysis is not reliable. Second, if a class has a 
mean vector very different from the mean vectors of the other classes, that class 
will be dominant in calculating the between-class scatter matrix, thus resulting 
in ineffective feature extraction. 
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Fukunaga recognized that the best representational features are not 
necessarily the best discriminating features and proposed a preliminary 
transformation (Fukunaga and Koontz 1 970). The Fukunaga-Koontz method 
first finds a transformation matrix T such that, 

T[S 1 + S 2 ]T 1 = I 

where S> is the autocorrelation matrix of class o>,. 

Fukunaga showed that TS 1 r t and TS 2 T' 1 have the same eigenvectors and all 
the eigenvalues are bounded by 0 and 1. It can be seen that the eigenvector 
with the largest differences in eigenvalues is the axis with the largest 
differences in variances. The Fukunaga-Koontz method will work well in 
problems where the covariance difference is dominant with little or no mean 
difference. However, by ignoring the information of mean difference, the 
Fukunaga-Koontz method is not suitable in the general case and could lead to 
irrelevant results (Foley and Sammon 1975). 

Kazakos proposed a linear scalar feature extraction algorithm that 
minimizes the probability of error in discriminating between two multivariate 
normally distributed pattern classes (Kazakos 1978). By directly employing the 
probability of error, the feature extraction method finds the best single feature 
vector in the sense that it gives the smallest error. However, if more than one 
feature is necessary, it is difficult to generalize the method. 

Heydorn proposed a feature extraction method by deleting redundant 
features where redundancy is defined in terms of a marginal distribution 
function (Heydorn 1971). The redundancy test uses a coefficient of redundancy. 
However, the method does not find a redundant feature vector unless the vector 
is in the direction of one of the original feature vectors even though the 
redundant feature vector could be detected by a linear transformation. 

Decell et al. developed an explicit expression for the smallest 
compression matrix such that the Bayes classification regions are preserved 
(Decell et al. 1981). Young et al. extended the method to a general class of 
density functions know as ©-generalized normal densities (Young et al. 1985) 
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and Tubbs et al. discussed the problem of unknown population parameters 
(Tubbs et al. 1982). 

Feature selection using statistical distance measures has also been 
widely studied and successfully applied [(Swain and King 1973), (Swain and 
Davis 1978), and (Kailath 1967)]. However, as the dimension of data increases, 
the combination of bands to be examined increases exponentially, resulting in 
unacceptable computational cost. Several procedures to find a sub-optimum 
combination of bands instead of the optimum combination of bands have been 
proposed with a reasonable computational cost (Devijver and Kittler 1982). 
However, if the best feature vector or the best set of feature vectors is not in the 
direction of any original feature vector, more features may be needed to achieve 
the same performance. 

Depending on the characteristics of the data, it has been shown that the 
previous feature extraction/selection methods can be applied successfully. 
However, it is also true that there are some cases in which the previous 
methods fail to find the best feature vectors or even good feature vectors, thus 
resulting in difficulty in choosing a suitable method to solve a particular 
problem. Although some authors addressed this problem [(Malina 1981) and 
(Longstaff 1987)], there is still another problem. One must determine, for a given 
problem, how many features must be selected to meet the requirement. More 
fundamentally, it is difficult with the previous feature extraction/selection 
algorithms to predict the intrinsic discriminant dimensionality, which is defined 
as the smallest number of features needed to achieve the same classification 
accuracy as in the original space for a given problem. 

In this chapter, we propose a different approach to the problem of feature 
extraction for classification. The proposed algorithm is based on decision 
boundaries directly. The proposed algorithm predicts the minimum number of 
features needed to achieve the same classification accuracy as in the original 
space for a given problem and finds the needed feature vectors, and it does not 
deteriorate when mean or covariance differences are small. 
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3.3 Feature Extraction and subspace 

3.3.1 Feature Extraction and Subspace 

Let X be an observation in the N-dimensional Euclidean space E N . Then 
X can be represented by 

N N 

X = X a i a i where ( a i .a2«- a N} is a basis of E 

i-i 

Then feature extraction is equivalent to finding a subspace, W, and the new 
features can be found by projecting an observation into the subspace. Let W be 
a M-dimensional subspace of E N spanned by M linearly independent vectors, 

Pi>P2>->Pm- 

W = Span{p, p 2 ,..,p M } and dim(W) = M < N 
Assuming that pj's are orthonormal, the new feature set in subspace W is given 

by . 

{X'Pl X‘P 2 ,. m X’Pm) = {bi,b 2 ,..,b M } Where bj = X p f 


M 

Now let X = X b iPi • Then X will be an approximation to X in terms of a linear 

i-1 

combination of {Pi,P 2 .-.Pm} in the original N-dimensional space. 


3.3.2 Bayes’ Decision Rule for Minimum Error 

Now consider briefly Bayes’ decision rule for minimum error, which will 
be used later in the proposed feature extraction algorithm. Let X be an 
observation in the N-dimensional Euclidean space E N under hypothesis H t : X e 
o)j i=1 ,2. Decisions will be made according to the following rule. 

Decide a>i if P(co 1 )P(X|co 1 ) > P(o) 2 )P(X|a) 2 ) 
else 0)2 
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where P(X|cOj) is a conditional density function and P(coj) is a priori probability of 
class o)j. 


Let h(X) = -In 


PjXjtgi) 

P(X|co 2 ) 


and t = In 


P( Q)i ) 
P(0)2) 


. Then 


Decide co-j if h(X) < t 

else 0)2 

Feature extraction has been used in many applications, and the criteria 
for feature extraction can be different in each case. If feature extraction is 
directed specifically at classification, a criterion could be to maintain 
classification accuracy. As a new approach to feature extraction for 
classification, we will find a subspace, W, with the minimum dimension M and 
the spanning vectors (PJ of the subspace such that for any observation X 

(h(X) -t)(h(X) -t) > 0 (3.1) 

where X is an approximation of X in terms of a basis of subspace W in the 
original N-dimensional space. The physical meaning of (3.1) is that the 
classification result for X is the same as the classification result of X. In practice, 
feature vectors might be selected in such a way as to maximize the number of 
observations for which (3.1) holds with a constraint on the dimensionality of 
subspaces. In this chapter, we will propose an algorithm which finds the 
minimum dimension of a subspace such that (3.1) holds for all the given 
observations and which also finds the spanning vectors {Pi} of the subspace. In 
the next section, we define some needed terminology which will be used in 
deriving theorems later. 


3.4 Definitions 

3.4.1 Discriminantly Redundant Feature 

Feature extraction can be performed by eliminating redundant features, 
however, what is meant by "redundant" may be dependent on the application. 
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For the purpose of feature extraction for classification, we will define a 
"discriminantly redundant feature" as follows. 

Definition 3.1 We say the vector p k is discriminantly redundant if for a ny 
observation X 

(h(X) — t)(h( X ) — t) > 0 (3-1) 

In other words, 

if h(X) > t, then h(X)>tor 
if h(X) < t, then h(X)<t 

N N 

where X = X b iPi ancl ^ = X b iPi 

i-i i#k 


The physical meaning of (3.1) is that the classification result for X is the same 
as the classification result of X. Figure 3.1 shows an example of a discriminantly 
redundant feature. In this case even though X is moved along the direction of 
vector (3 k , the classification result will remain unchanged. This means vector p k 
makes no contribution in discriminating between classes, thus vector p k is 
redundant for the purpose of classification. 



Figure 3.1 An example of a discriminantly redundant feature. 
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3.4.2 Discriminantly Informative Feature 

In a similar manner, we define a discriminantly informative feature. 

Definition 3.2 We say that (3 k is discriminantly informative if there exists al 
least one observation Y such that 

(h(Y) -t)(h(Y)-t) <0 (3.2) 

In other words, 

h(Y) > t but h( Y) <t or 
h(Y) < t but h( Y) >t 

N N 

where Y = YbjPj and Y = £t>iPi 

i-1 i*k 

The physical meaning of (3.2) is that there exists an observation Y such that the 
classification result of Y is different from the classification result of Y. It is noted 
that (3.2) need not hold for all observations. A vector will be discriminantly 
informative if there exists at least one observation whose classification result 
can be changed as the observation moves along the direction of the vector. 
Figure 3.2 shows an example of a discriminantly informative feature. In this 
case, as Y is moved along the direction of vector p k , the classification result will 

be changed. 



Figure 3.2 An example of a discriminantly informative feature. 


- 34 - 


3 Decision Boundary Feature Extraction 


3.4.3 Decision Boundaries and Effective Decision Boundaries 

The decision boundary of a two-class problem is a locus of points on 
which a posteriori probabilities are the same. To be more precise, we define a 
decision boundary as follows: 

Definition 3.3 A decision boundary is defined as 

{ X | h(X) = t } 

A decision boundary can be a point, line, curved surface or curved hyper- 
surface. Although a decision boundary can be extended to infinity, in most 
cases some portion of the decision boundary is not significant. For practical 
purposes, we define the effective decision boundary as follows. 

Definition 3.4 The effective decision boundary is defined as 

{ X | h(X) ss t , X e R, orXe R 2 } 

where Ri is the smallest region which contains a certain portion, Pthreshoid- 

class <*>, and R 2 is the smallest region which contains a certain portion, 

P threshold* °f Class 0)3. 

The effective decision boundary may be seen as an intersection of the decision 
boundary and the regions where most of the data are located. Figures 3.3 and 
3.4 show some examples of decision boundaries and effective decision 
boundaries. In these examples, the threshold probability, Pthreshoid- ' s se * 
99.9%. In the case of Figure 3.3, the decision boundary is a straight line and the 
effective decision boundary is a straight line segment, the latter being a part of 
the former. In Figure 3.4, the decision boundary is an ellipse and the effective 
decision boundary is a part of that ellipse. 
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Figure 3.3 S 1 =Z 2 - The decision boundary is a straight line and the 

effective decision boundary is a line segment coincident to it. 



Figure 3.4 M 1 ^M 2 , 2^*1^ The decision boundary and the effective decision 

boundary. 

3.4.4 Intrinsic Discriminant Dimension 

One of the major problems of feature extraction for classification is to find 
the minimum number of features needed to achieve the same classification 
accuracy as in the original space. To be more exact, we define the term, 
"intrinsic discriminant dimension". 

Definition 3.5 The Intrinsic discriminant dimension for a given problem is 
defined as the smallest dimension of a subspace, W, of the tri- 
dimensional Euclidean space E N such that for any observation X in the 
problem, 

(h(X) -t)(h(X) — t) > 0 
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M 

where X = ]^ a iPi e W and M < N. 

The intrinsic discriminant dimension can be seen as the smallest dimensional 
subspace wherein the same classification accuracy can be obtained as could 
be obtained in the original space. 

The intrinsic discriminant dimension is related to the discriminantly 
redundant feature vector and the discriminantly informative feature vector. In 
particular, if there are M linearly independent discriminantly informative feature 
vectors and L linearly independent discriminantly redundant feature vectors, 
then it can be easily seen that 

N= M + L 

where N is the original dimension and the intrinsic discriminant dimension is 
equal to M. Figure 3.5 shows an example of the intrinsic discriminant 
dimension. In the case of Figure 3.5, the intrinsic discriminant dimension is one 
even though the original dimensionality is two. If V 2 is chosen as a new feature 
vector, the classification accuracy will be the same as in the original 2- 
dimensional space. 



Figure 3.5 i 1 «Z 2 - ln this case the intrinsic discriminant dimension is one 

even though the original space is two dimensional. 
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3.5 Feature Extraction Based on the Decision Boundary 
3.5.1 Redundancy Testing Theorem 

From the definitions given in the previous section, a useful theorem can 
be stated which tests whether a feature vector is a discriminantly redundant 
feature or a discriminantly informative feature. 

Theorem 3.1 If a vector is parallel to the tangent hyper-plane to the decision 
boundary at every point on the decision boundary for a pattern 
classification problem, the vector contains no information useful in 
discriminating between classes for the pattern classification problem, 
i.e., the vector is discriminantly redundant. 

Proof. Let {p 1 ,p 2 ,..,pN}b® a basis of the N-dimensional Euclidean space E N , and 
let p N be a vector that is parallel to the tangent hyper-plane to the decision 
boundary at every point on the decision boundary. Let W be a subspace 
spanned by N-1 spanning vectors, Pi.P 2 .--.Pn-i. i- e -» 

W = Span{p 1 ,p 2 ,..,pN-i} and dim(W) = N-1 

If b N is not a discriminantly redundant feature, there must exist an observation X 
such that 

(h(X)-t)(h(X ) -t ) < 0 

N _ N-1 

where X= ]TbjPj and X = £qPj 

i-i i-i 


Without loss of generality, we can assume that the set of vectors Pi,P 2 .--.Pn ' s an 
orthonormal set. Then b, = q for i=1 ,N-1 . Assume that there is an observation X 

such that 

(h(X) -t )(h(X ) - t ) < 0 

This means X and X are on different sides of the decision boundary. Then the 
vector 

X d =X- X =b N p N 
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where b N is a coefficient, must pass through the decision boundary. But this 
contradicts the assumption that Pn is parallel to the tangent hyper-plane to the 
decision boundary at every point on the decision boundary. Therefore if p N is a 
vector parallel to the tangent hyper-plane to the decision boundary at every 
point on the decision boundary, then for all observations X 

(h(X) -t )(h( X ) — t ) > 0 

Therefore p N is discriminantly redundant. Figure 3.6 shows an illustration of the 



Figure 3.6 If two observations are on the different sides of the decision 
boundary, the line connecting the two observations will pass 
through the decision boundary. 


It is noted that we did not make any assumption on the number of classes 
in proving Theorem 3.1. In other words, Theorem 3.1 holds for any number of 
classes. From the theorem, we can easily derive the following lemmas which 
are very useful in finding discriminantly informative features. 

Lemma 3.1 If vector V is orthogonal to the vector normal to the decision 
boundary at every point on the decision boundary, vector V contains no 
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information useful in discriminating between classes, i.e., vector V is 
discriminantly redundant. 

Lemma 3.2 If a vector is normal to the decision boundary at at least one 
point on the decision boundary, the vector contains information useful in 
discriminating between classes, i.e., the vector is discriminantly 
informative. 

3.5.2 Decision Boundary Feature Matrix 

From the previous theorem and lemmas, it can be seen that a vector 
normal to the decision boundary at a point is a discriminantly informative 
feature, and the effectiveness of the vector is roughly proportional to the area of 
the decision boundary which has the same normal vector. Now we can define a 
DECISION BOUNDARY FEATURE MATRIX which is very useful to predict the 
intrinsic discriminant dimension and find the necessary feature vectors. 

Definition 3.6 The decision boundary feature matrix (DBFM): Let N(X) be the 
unit normal vector to the decision boundary at a point X on the decision 
boundary for a given pattern classification problem. Then the decision 
boundary feature matrix I DBFM is defined as 

Wm - R jN(X>N'(X)p(X)dX 

where p(X) is a probability density function, K= Jp(X)dX, and S is the 

s 

decision boundary, and the integral is performed over the decision boundary. 

We will show some examples of the decision boundary feature matrices next. 
Even though the examples are in 2-dimensional space, the concepts can be 
easily extended to higher dimensional spaces. In all examples, a Gaussian 
Maximum Likelihood classifier is assumed. 
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Example 3.1 The mean vectors and covariance matrices of two classes are 
given as follows: 


Mt = 



Ii = 


' 1 0.5' 
.0.5 1 . 



P(tO-|) = P(ol»2) = 0.5 


These distributions are shown in Figure 3.7 as "ellipses of concentration. In a 
two-class, two-dimensional pattern classification problem, if the covariance 
matrices are the same, the decision boundary will be a straight line and the 
intrinsic discriminant dimension is one. This suggests that the vector normal to 
the decision boundary at any point is the same. And the decision boundary 
feature matrix will be given by 


I db fm = R |N(X)N‘(X)p(X)dX = j^NN* 


Jp(X)dX = NN l 
S 





Rank(Z D BFM) = ^ 

It is noted that the rank of the decision boundary feature matrix is one which is 
equal to the intrinsic discriminant dimension and the eigenvector corresponding 
to the non-zero eigenvalue is the desired feature vector which gives the same 
classification accuracy as in the original 2-dimensional space. 
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Figure 3.7 An example where the covariance matrices of two classes are 
the same and the decision boundary is a straight line. 

Example 3.2 The mean vectors and covariance matrices of two classes are 
given as follows: 



P(o>i) = P(g)2) = 0.5 

The distributions of the two classes are shown in Figure 3.8 as "ellipses of 
concentration." In this example, the decision boundary is a circle and symmetric, 
and j^p(X) is a constant given by -j- where r is the radius of the circle. The 

decision boundary feature matrix will be given by 

271 

Xqbfm = ~ — [cos9 sinG^fcosG sin0] r d9 

j 

o 

2n 

1 f cosGcosG cos9sin0 ^ 

= 2 k sin9cos0 sin9sin0J 

J *- 
0 
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Rank(X D BFM) = ^ 


From the distribution of data, it is seen that two features are needed to achieve 
the same classification accuracy as in the original space. This means that the 
intrinsic discriminant dimension is 2 in this case. It is noted that the rank of the 
decision boundary feature matrix is also 2, which is equivalent to the intrinsic 
discriminant dimension. 



Figure 3.8 The decision boundary feature matrix for equal means and 
different covariances. 


In a similar way, we define an EFFECTIVE DECISION BOUNDARY 
FEATURE MATRIX. The effective decision boundary feature matrix is the same 
as the decision boundary feature matrix except that only the effective decision 
boundary instead of the entire decision boundary is considered. 


Definition 3.7 The effective decision boundary feature matrix (EDBFM): Let 
N(X) be the unit normal vector to the decision boundary at a point X on 
the effective decision boundary for a given pattern classification problem. 
Then the effective decision boundary feature matrix Iedbfm ' s defined as 

W, - £ jN(X)N'(X)p(X)dX 

s 
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where p(X) is a probability density function, K’= Jp(X)dX, and S' is the 

S’ 

effective decision boundary as defined in Definition 3.4, and the integral 
is performed over the effective decision boundary. 


3.5.3 Properties of Decision Boundary Feature Matrix 

In this section, some properties of the decision boundary feature matrix 
will be discussed. 

Property 3.1 The decision boundary feature matrix is a real, symmetric 
matrix. 

Proof: It can be shown that £DBFM=(^DBFM) t as follows: 

(Zobfm)' - {R /N(X)N t (X)p(X)dX} t 

-R f{N(X)N'(X)}‘p(X)dX 

=i jN(X)N’(X)p(X)dX 

s 

=^OBFM 

Property 3.2 The eigenvectors of the decision boundary feature matrix are 
orthogonal. 

Proof: Since the decision boundary feature matrix is a real symmetric matrix, the 
eigenvectors of the decision boundary feature matrix are orthogonal (Cullen 
1972). 
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Property 3.3 The decision boundary feature matrix is positive semi-definite. 

Proof: Let N be a real column vector. Then the matrix, NN 1 , is positive semi- 
definite (Cullen 1972). Let l be an eigenvalue of Z D bfm and 9*0 an associated 

eigenvector. Then 
^obfm 9=^9 

And 

(p t E 0BFM cp=(p t X(p 


, tP^DBFM^ 

K — * 

99 

= 4Vfi? jN(X)N t (X)p(X)dX}(p 
99 * S 

= 44 f9 , N(X)N , (X)q>p(X)dX > 0 
9 9 K S 

where ^^(XJN^XJcp > 0 for any X, 

p(X) > 0 since p(X) is a probability density function, 

K = Jp(X)dX > 0, 
s 

(p l (p > 0. 


Thus, the decision boundary feature matrix is also positive semi-definite. 

Property 3.4 The decision boundary feature matrix of the whole decision 
boundary can be expressed as a summation of the decision boundary 
feature matrices calculated from segments of the whole decision 
boundary if the segments are mutually exclusive and exhaustive. 

Proof: Let S be the whole decision boundary. Let S-|uS2=S and S-|nS2=0. 


Then 
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^OBFM = k jN(X)N l (X)p(X)dX 
S 

= 1 |N(X)N'(X)p(X)dX + i J N (X)N>(X)p(X)dX 
Si S 2 

- E ® 1 + E ® 2 

“ ^DBFM + ^DBFM 

where £®* is the decision boundary feature matrix calculated from the 

UDrM 

segment decision boundary Sj. 

Figure 3.9 shows an illustration. The decision boundary of Figure 3.9 is a circle. 
Let Si be the upper half of the circle and S 2 the lower half of the circle. Then the 
decision boundary feature matrix can be expressed as a summation of the 
decision boundary feature matrix calculated from Si and the decision boundary 
feature matrix calculated from S 2 . 



Feature 1 

Figure 3.9 The decision boundary feature matrix can be calculated by 
segments. 


From Property 3.4, we can calculate the decision boundary feature matrix 
of a multiclass problem by summing up the decision boundary feature matrices 
of each pair of classes. Figure 3.10 shows an example. The decision boundary 
feature matrix of the 3-class problem can be calculated as follows: 
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■^dbfm = k jN(X)N l (X)p(X)dX 
S 

-i |N(X)N t (X)p(X)dX+^ |N(X)N t (X)p(X)dX 
K Si2 S 13 

+ £ jN(X)N l (X)p(X)dX 
S 23 


_ y^12 
“ ^DBFM 


+ Z Sl3 

+ ^DBFM 


+ r 


S 23 

DBFM 


Si 2 



Figure 3.10 The decision boundary feature matrix of a multiclass problem can 
be calculated from the decision boundary feature matrices from 
each pair of classes. 


3.5.4 Decision Boundary Feature Matrix for Finding the Intrinsic Discriminant 
Dimension and Feature Vectors 

From the way the decision boundary feature matrix is defined and from 
the examples, one might suspect that the rank of the decision boundary feature 
matrix will be the intrinsic discriminant dimension, and the eigenvectors of the 
decision boundary feature matrix of a pattern recognition problem 
corresponding to non-zero eigenvalues are the required feature vectors to 
achieve the same classification accuracy as in the original space. In this regard 
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we state the following two theorems which are useful in predicting the intrinsic 
discriminant dimension of a pattern classification problem and finding the 
feature vectors. 

Theorem 3.2 The rank of the decision boundary feature matrix I DB fm 
(Definition 3.6) of a pattern classification problem will be the intrinsic 
discriminant dimension (Definition 3.5) of the pattern classification 
problem. 

Proof: Let X be an observation in the N-dimensional Euclidean space E N under 
the hypothesis Hj: X e coj {i = 1 ,...,J} where J is the number of classes. Let Zqbfm 
be the decision boundary feature matrix as defined in Definition 3.6. Suppose 
that 


rank(I DBFM ) = M < N. 


Let {<J> 1 , 4>m} be the eigenvectors of Eqbfm corresponding to non-zero 

eigenvalues. Then a vector normal to the decision boundary at any point on 
decision boundary can be represented by a linear combination of <|>j, i=1,..,M. In 
other words, for any normal vector V N to the decision boundary 

M 

V N = la* 

i.l 


Since any linearly independent set of vectors from a finite dimensional vector 
space can be extended to a basis for the vector space, we can expand {4» 1( 
to form a basis for the N-dimension Euclidean space. Let {4>i . <l> 2 *--< 0m> 

<t> N } be such a basis. Without loss of generality, we can assume 4>m- 

<j> m+ i,.., <() N } is an orthonormal basis. One can always find an orthonormal basis 
for a vector space using the Gram-Schmidt procedure (Cullen 1972). Since the 
basis is assumed to be orthonormal, it can be easily seen that the vectors , 
<j> M+2 ,.., <|> N }, are orthogonal to any vector V N normal to the decision boundary. 
This is because for i = M+1 ,..,N 

t M 

0j V N = 

k-1 
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M . 

= £a k <t>j<t> k = 0 since <|>j<|> k = 0 if i*k 

k-1 

Therefore, since the vectors {<!>m+i> ^n} are orthogonal to any vector 

normal to the decision boundary, according to Lemma.1, the vectors {<t>M+i. 
^M+ 2 *”* 4>n} are discriminantly redundant. Therefore the number of discriminantly 
redundant features is N - M, and the intrinsic discriminant dimension is M which 
is the rank of decision boundary feature matrix Zdbfm- 

Q.E.D* 

It is noted that we did not make any assumption on the number of classes 
in proving Theorem 3.2. In other words, Theorem 3.2 holds for any number of 
classes. From Theorem 3.2, we can derive the following theorem which is useful 
to find the feature vectors needed to achieve the same classification accuracy 
as in the original space. 

Theorem 3.3 The eigenvectors of the decision boundary feature matrix of a 
pattern recognition problem corresponding to non-zero eigenvalues are 
the feature vectors needed to achieve the same classification accuracy 
as in the original space for the pattern recognition problem. 

Proof: In the proof of Theorem 3.2, it was shown that the eigenvectors of I D bfm 
corresponding to non-zero eigenvalues are the only discriminantly informative 
feature vectors. Thus by retaining the eigenvectors of Edbfm corresponding to 
non-zero eigenvalues, it is possible to achieve the same classification accuracy 
as in the original space. 

Q.E.D. 


3.5.5 Procedure to Find the Decision Boundary Feature Matrix 

Assuming a Gaussian ML classifier is used, the decision boundary will 
be a quadratic surface if the covariance matrices are different. In this case, the 
rank of the decision boundary feature matrix will be the same as the dimension 
of the original space except for some special cases. However, in practice, only a 
small portion of the decision boundary is significant. Therefore if the decision 
boundary feature matrix is estimated using only the significant portion of the 
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decision boundary or the effective decision boundary, the rank of the decision 
boundary feature matrix, equivalently the number of features, can be reduced 
substantially while achieving about the same classification accuracy. 

More specifically, the significance of any portion of the decision boundary 
is related to how much accuracy can be achieved by utilizing that portion of the 
decision boundary. Consider the case of Figure 3.11 which shows the two 
regions which contain 99.9% of each Gaussianly distributed class, along with 
the decision boundary and the effective decision boundary of 99.9%. Although 
in this example the threshold probability, Pthreshoid* ' s se ^ 9^.9% arbitrarily, it 
can be set to any value depending on the application (See Definition 3.4). If 
only the effective decision boundary, which is displayed in bold, is retained, it is 
still possible to classify 99.9% of data from class co, the same as if the whole 
decision boundary had been used, since the effective decision boundary 
together with the boundary of the region which contains 99.9% of class a>i can 
divide the data of class co, into two groups in the same manner as if the whole 
decision boundary is used; less than 0.1% of data from class co^ may be 
classified differently. 

Therefore, for the case of Figure 3.11, the effective decision boundary 
displayed as a bold line plays a significant role in discriminating between the 
classes, while the part of the decision boundary displayed as a non-bold line 
does not contribute much in discriminating between the classes. On the other 
hand, other portions of the decision boundary, displayed as a dotted line, would 
be very rarely used. 

It is noted, however, that even though only the effective decision 
boundary is used for feature extraction, this does not mean that the portion 
outside of the effective regions does not have a decision boundary. The actual 
decision boundary is approximated by the extension of the effective decision 
boundary as shown in Figure 3.11. As shall be seen, feature extraction based 
on the effective decision boundary instead of the complete decision boundary 
will result in fewer features while achieving nearly the same classification 
accuracy. 


50 - 



3 Decision Boundary Feature Extraction 



New decision boundary represented by 
the effective decision boundary outside 
the effective regions 


Figure 3.11 An example of a decision boundary and an effective 
decision boundary. 


Next we propose a procedure for calculating the effective decision 
boundary feature matrix numerically. 


Numerical Procedure to Find the Effective Decision Boundary Feature Matrix 

(2 pattern classes) 

1 Let M; and be the estimated mean and covariance of class coj. Classify 
the training samples using full dimensionality. Apply a chi-square threshold 
test to the correctly classified training samples of each class and delete 
outliers. In other words, for class o)j, retain Xonly if 

(X- MJ'Zr^X- M.) < R n 

In the following STEPS, only correctly classified training samples which 
passed the chi-square threshold test will be used. Let {X 1t X 2 X L) } be such 

training samples of class co 1 and {Y^Y* Y L2 ) be such training samples of 
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2. Apply a chi-square threshold test of class co-j to the samples of class co 2 and 
retain Yj only if 

(Y j -M l ) t Sr 1 (Y j -M 1 )<R t2 

If the number of the samples of class co 2 which pass the chi-square threshold 
test is less than L min (see below), retain the L mjn samples of class co 2 which 
gives the smallest values. 

3. For Xj of class coi, find the nearest sample of class retained in STEP 2. 

4. Find the point Pj where the straight line connecting the pair of samples found 

in STEP 3 meets the decision boundary. 

5. Find the unit normal vector, N j( to the decision boundary at the point P, found 

in STEP 4. 

6. By repeating STEP 3 through STEP 5 for X j( i=1,..,L 1 , L-, unit normal vectors 
will be calculated. From the normal vectors, calculate an estimate of the 
effective decision boundary feature matrix (£|=dbfm) from c,ass “i as f° l,ows: 


^EDBFM “ L 1 X N i N i 


Repeat STEP 2 through STEP 6 for class % 

7. Calculate an estimate of the final effective decision boundary feature matrix 
as follows: 


I <1 2 

^EDBFM = 2 ( ^EDBFM +^EDBFM ) 


The chi-square threshold test in STEP 1 is necessary to eliminate 
outliers. Otherwise, outliers may give a false decision boundary when classes 
are well separable. The chi-square threshold test to the other class in STEP 2 is 
necessary to concentrate on effective decision boundary (Definition 4). 
Otherwise, the decision boundary feature matrix may be calculated from an 
insignificant portion of decision boundary, resulting in ineffective features. In the 
experiments, L m j n in STEP 2 is set to 5 and R t i is chosen such that 
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Pr{X|(X - MJ'i^VX- MJ < R t1 } = 0.95, i-1,2, and R t1 - R t 2 

The threshold probability is taken as 0.95. In an ideal case assuming a 
Gaussian distribution, the threshold probability can be larger, i.e., 0.999. 
However, for real data, if the threshold probability is set too large, some outliers 
could be included, causing some inefficiency in calculating the decision 

boundary feature matrix. 

Figure 3.12 shows an illustration of the proposed procedure. For each 
sample, the nearest sample classified as the other class is found and the two 
samples are connected by a straight line. Then a vector normal to the decision 
boundary is found at the point where the straight line connecting the two points 
meets the decision boundary. From these normal vectors, Eepbfm is estimated. 



Figure 3 1 2 Illustration ot the procedure to find the effective decision 
boundary feature matrix numerically. 


If we assume a Gaussian distribution for each class and the Gaussian ML 
classifier is used, h(X) in equation (3.1) is given by 

h(X) - -ln“~~7 “ lnP(Xla>a) - lnP(X|u,) 

P(X!co 2 ) 1 

= 1 (X - (X - ) + \ ln|Ii | - 2 ( x “ M 2) ts 2 (X-M 2 )- 2 ln l^l 

The vector normal to the decision boundary at X„ is given by (See Appendix A) 
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N= Vh(X)| x , X0 =( + (Z:, 1 M 2 - Z^Mi) (3.3) 

If P ! and P 2 are on different sides of decision boundary h(X) = t 
assuming that the Gaussian ML classifier is used, the point X 0 where the line 
connecting P 1 and P 2 passes through the decision boundary is given by 
(Appendix A) 


X 0 =uV+ V 0 


(3.4) 


where V 0 = P 1 

V = P 2 -Pi 
u = t b C if a = 0, 

-b ± V b 2 - 4a(c’ - t) ^ 
u = 2 ^ — i L anc * 0 < u < 1 if a * 0, 

a-jV'lLi'-ZiV, 

b = Vo'dj 1 - Z^')V - (Nl'.Z',’ - Nlj.ZjV. 

C’ = \ - Iz )Vo - (MUV - M*2l' 2 Vo+ c, 

c - \ (M'Z', 1 M, - M&' M 2 ) + 1 

Equation (3.4) can be used to calculate the point on the decision boundary from 
two samples classified differently and equation (3.3) can be used to calculate a 
normal vector to the decision boundary. 


3.5.6 Decision Boundary Feature Matrix for Multiclass Problem 

If there are more than two classes, the total decision boundary feature 
matrix can be defined as the sum of the decision boundary feature matrices of 
each pair of classes. If prior probabilities are available, the summation can be 
weighted. In other words, if there are M classes, the total decision boundary 
feature matrix can be defined as 
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M M I* 

Xdbfm - £ £ P(0)i)P(c0i)sieFM (3-5) 

i i. i ** 

where Xqbfm > s the decision boundary feature matrix between 
class oo, and class coj and P(cOj) is the prior probability of class coj if 
available. Otherwise let P(c0j)=1/M. 

It is noted that Theorem 3.2 and Theorem 3.3 still hold for the multiclass case 
and the eigenvectors of the total decision boundary feature matrix 
corresponding to non-zero eigenvalues are the necessary feature vectors to 
achieve the same classification accuracy as in the original space. In practice, 
the total effective decision boundary feature matrix can be calculated by 
repeating the procedure for each pair of classes. 


3.5.7 Eliminating Redundancy in Multiclass Problem 

The total decision boundary feature matrix defined in equation (3.5), can 
be made more efficient. Consider the following example situation. Suppose 
Table 3.1 shows eigenvalues for the 2 pattern class problem of Table 3.6. Table 
3.1 also shows proportions of the eigenvalues, classification accuracies, and 
normalized classification accuracies obtained by dividing the classification 
accuracies by the classification accuracy obtained using all features. With just 
one feature, the classification accuracy is 93.4% which is 97.9% of the 
classification accuracy obtained using all features. Thus, in this 2 class problem, 
if this level of accuracy is deemed adequate, just one feature is necessary to be 
included in calculating the total decision boundary feature matrix. The other 19 
features contributes little in improving classification accuracy and can be 
eliminated in calculating the total decision boundary feature matrix. In addition, 
feature vectors from other pairs of classes will improve the classification 

accuracy. 
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Table 3.1 Eigenvalues and classification accuracies of the 2 class problem. 


1 

Eigenvalues 

Proportion of 
Eigenvalues 
(%) 

Classification 

Accuracy 

(%) 

Normalized 
Classification 
Accuracy (%) 

rr 

0.994 

49.6 

93.4 

97.9 

mm 

0.547 

27.3 

94.3 

98.8 

ei 

0.167 

8.3 

94.4 

99.0 

mm 

0.133 

6.6 

95.0 

99.6 

(5 

0.066 

3.3 

95.1 

99.7 

mm 

0.041 

2.1 

94.9 

99.5 


0.020 

1.0 

94.9 

99.5 

El 

0.012 

0.6 

94.8 

99.4 

El 

0.008 

0.4 

95.0 

99.6 

■El 

0.007 

0.3 

95.3 

99.9 

m 

0.005 

0.2 

95.3 

99.9 

■a 

0.001 

0.1 

95.7 

100.3 

■El 

0.001 

0.0 

95.5 

100.1 

■a 

0.001 

0.0 

95.4 

100.0 

MEM 

0.000 

0.0 

95.3 

99.9 

m 

0.000 

0.0 

95.6 

100.2 

■a 

0.000 

0.0 

95.5 

100.1 

mem 

0.000 

0.0 

95.5 

100.1 

■El 

0.000 

0.0 

95.4 

100.0 

El 

0.000 

0.0 

95.4 

100.0 


To eliminate such redundancy in multiclass problems, we define the 
decision boundary feature matrix of P t (£dbfm(R)) as follows: 

Definition 3.8 Let L t be the number of eigenvectors corresponding to largest 
eigenvalues needed to obtain P t of the classification accuracy obtained 
using all features. Then the decision boundary feature matrix of P t 
(^DBFM(Pt)) ' s defined as 

^DBFM(Pt) = X^Wi ( Pi 
i-1 

where X\ and <pj are eigenvalues and eigenvectors of the decision 
boundary feature matrix. 

The total decision boundary feature matrix of P t in a multiclass problem can be 
defined as 


M 

^DBFM(Pt) = X 
i-1 


M 

X P ( to i) P ( co j) 2 DBFM(R) 
i-1 i*' 
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where s{, BFM( p t ) is the decision boundary feature matrix of P t 
between class Oj and class coj and P(o)j) is the prior probability of 
class (Oj if available. Otherwise let P(coj)=1/M. 

From Definition 3.8, we can calculate the decision boundary feature matrix of 
0.95 of Table 3.1 as follows: 

The classification accuracy using full dimensionality (assume it is 20) is 
95.4%. The number of features needed to achieve a classification accuracy of 
92. 5%(=95. 4*0.95) is 1. Therefore, the decision boundary feature matrix of 
0.95 of Table 3.1 is given by 

1 t t 

S DBFM(0.95) = X = 

i-1 

where X.j's are eigenvalues of £dbfm sorted in descending order and <Pj s are 
the corresponding eigenvectors. 


Figure 3.13 shows a performance comparison for various values of Pt- By 
eliminating feature vectors which contribute little to improvement of the 
classification accuracy, it is possible to improve classification accuracy up to 
1.5% in this example. The experiment showed P t between 0.95 and 0.97 would 

be reasonable. 
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■ Decision Boundary Feature Extraction Pt=0.99 
§ Decision Boundary Feature Extraction Pt=0.999 


Figure 3.13 Performance comparison for various P t s. 


3.6 Experiments and Results 

3.6.1 An experiment with generated data 

To evaluate closely how the proposed algorithm performs under various 
circumstances, tests are conducted on data generated with given statistics 
assuming Gaussian distributions. In all examples, a Gaussian ML classifier is 
used and the same data are used for training and test. In each example, the 
Foley & Sammon method (Foley and Sammon 1975) and the Fukunaga & 
Koontz method (Fukunaga and Koontz 1970) are discussed. In particular, 
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classification accuracies of the decision boundary feature extraction method 
and the Foley & Sammon method are compared. 


Example 3.3 In this example, data are generated for the following statistics. 


M 1 = 



‘ 1 0.5* 

„ rr 

1 

— i 

0 

01 

.0.5 1 . 

M 2 = L-1_ 

- ■ l 2 = L 0.5 1 J 


P(a>i) = P(co2) = 0.5 


300 samples are generated for each class and all samples are used for training 
and test. Since the covariance matrices are the same, it can be easily seen that 
the decision boundary will be a straight line and just one feature is needed to 
achieve the same classification accuracy as in the original space. The 
eigenvalues Xj and the eigenvectors pj of Eedbfm are calculated as follows: 



f 0.71* 

1*0.70* 

A, 1 =0.99995 , X 2 = 0.00005 

<t>i =L-0.70j’ 

§2 = 1.0.71. 


Since one eigenvalue is significantly larger than the other, it can be said that 
the rank of I EDBFM is 1- That means only one feature is needed to achieve the 
same classification accuracy as in the original space. Considering the statistics 
of the two classes, the rank of ^edbfm the correct number of features to 
achieve the same classification accuracy as in the original space. Figure 3.14 
shows the distribution of the generated data and the decision boundary found 
by the proposed procedure. Since class mean differences are dominant in this 
example, the Foley & Sammon method will also work well. However, the 
Fukunaga & Koontz method will fail to find the correct feature vector. Table 3.2 
shows the classification accuracies of Decision Boundary Feature Extraction 
and Foley & Sammon method. With two features, the classification accuracy is 
95.8% and both methods achieve the same accuracy with just one feature. 

Table 3.2 Classification accuracies of Decision Boundary Feature Extraction 
and the Foley & Sammon method of Example 3.3. 


No. Features 

Decision Boundary 
Feature Extraction 

Foley & Sammon 
Method 

1 

95.8 (%) 

95.8 (%) 

2 

95.8 (%) 

95.8 (%) 
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-4 -2 0 2 4 


Feature 1 

o Class 1 a Class 2 

a Decision boundary found by the procedure 

Figure 3.14 The distribution of data for the two classes in Example 3.3. The 
decision boundary, found by the proposed algorithm, is also 
shown. 


Example 3.4 In this example, data are generated with the following statistics. 

ro.oil „ [3 01 ,, r-0.011 [3 01 

Mi-[ o J ,e i = L0 3J m 2-[_ 0 J’^'LO ij 

P(C0 n ) = P(C02) = 0.5 


300 samples are generated for each class and all samples are used for training 
and test. In this case, there is almost no difference in the mean vectors and 
there is no correlation between the features for each class. The variance of 
feature 1 of class is equal to that of class co 2 while the variance of feature 2 of 
class o) 1 is larger than that of class c^. Thus the decision boundary will consist of 
hyperbolas, and two features are needed to achieve the same classification 
accuracy as in the original space. However, the effective decision boundary 
could be approximated by a straight line without introducing significant error. 
Figure 3.15 shows the distribution of the generated data and the decision 
boundary obtained by the proposed procedure. The eigenvalues X\ and the 
eigenvectors 4>j of £edbfm sire calculated as follows. 


60 



3 Decision Boundary Feature Extraction 



fO.06' 

, r-i.ooi 

Xi =0.92421, \ 2 = 0.07579 

$i =Li.ooJ* 

< fe = L0.06j 


Since the rank of I E dbfm is 2, two features are required to achieve the same 
classification accuracy as in the original space. However, X 2 is considerably 
smaller than X 1t even though X 2 is not negligible. Therefore, nearly the same 
classification accuracy could be achieved with just one feature. 


Since there is a very small difference in the mean vectors in this example, the 
Foley & Sammon method will fail to find the correct feature vector. On the other 
hand, the Fukunaga & Koontz method will find the correct feature vector. Table 
3.3 shows classification accuracies. Decision Boundary Feature Extraction 
achieves the same accuracy with one feature as can be obtained with two 
features while the Foley & Sammon method fails to find the right feature in this 

example. 



o Class 1 A Class 2 

■ Decision boundary found by the procedure 


Figure 3.15 


Distribution of data from the two classes in Example 3.4. The 
decision boundary found by the proposed algorithm is also 

shown. 
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Table 3.3 Classification accuracies of Decision Boundary Feature Extraction 
and the Foley & Sammon method of Example 3.4. 


No. Features 

Decision Boundary 
Feature Extraction 

Foley & Sammon 
Method 

1 

61 .0 (%) 

52.5 (%) 

2 

61 .0 (%) 

61.0 (%) 


Example 3.5 In this example, we generate data for the following statistics. 


"O' 


[3 0 0' 


TT 


"I 0 0' 

0 

,Zi- 

0 3 0 

,m 2 = 

0 

.*2 = 

0 1 0 

0 


0 0 1 


0 


0 0 1 


P((o 1 ) = P(a) 1 ) = 0.5 


200 samples are generated for each class and all samples are used for training 
and test. In this case, there is no difference in the mean vectors and there are 
variance differences in only two features. It can be seen that the decision 
boundary will be a right circular cylindrical surface of infinite height and just two 
features are needed to achieve the same classification accuracy as in the 
original space. Eigenvalues Xj and eigenvectors 0j of Zedbfm are calculated as 
follows: 


= 0.57581 , X 2 = 0.42032, X 3 = 0.00387 


[0.861 


[0.49] 


[-0.2-1] 

-050 

. 4>2 = 

0.84 

. 4*3 = 

-0.18 

001 

0.21 

0.98 


Rank(I EDBFM ) = 2 


Since the rank of Eedbfm ' s roughly 2, it can be said that two features are 
required to achieve the same classification accuracy as in the original space, 
which agrees with the data. Since there is no difference in the mean vectors in 
this example, the Foley & Sammon method will fail to find the correct feature 
vectors. On the other hand, the Fukunaga & Koontz method will find the correct 
feature vector. Table 3.4 shows the classification accuracies. Decision 
Boundary Feature Extraction finds the two effective feature vectors, achieving 
the same classification accuracy as in the original space. 
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Table 3.4 Classification accuracies of Decision Boundary Feature Extraction 
and the Foley & Sammon method of Example 5. 


No. Features 

Decision Boundary 
Feature Extraction 

Foley & Sammon 
Method 

1 

65.0 (%) 

62.3 (%) 

2 

70.0 (%) 

60.5 (%) 

3 

70.0 (%) 

70.0 (%) 


From the experiments with generated data for given statistics, it is noted 
that the proposed feature extraction algorithm based on the decision boundary 
performs well even if there is no mean difference (Examples 3.4, 3.5) or no 
covariance difference (Example 3.3) without any deterioration. On the other 
hand, the Foley & Sammon method fails if there is no mean difference 
(Examples 3.4, 3.5) and the Fukunaga & Koontz method would fail if there is no 
covariance difference (Example 3.3) or significant mean difference (Foley and 
Sammon 1975). In Chapter 5, the decision boundary feature extraction 
algorithm is applied to a 3-class problem (generated data). 


3.6.2 Experiments with real data 
3.6.2. 1 FSS Data and Preprocessing 


In the following experiments, tests are conducted using multispectral data 
which was collected as a part of the LACIE remote sensing program (Biehl et al. 
1982) and major parameters are shown in Table 3.5. 

Table 3.5 Parameters of Field Spectrometer System. 


1 Number of Bands 

60 _| 

Spectral Coverage 

0.4 - 2.4 urn I 

1 Altitude 

60 m _| 

1 IFOV(qround) , 

25 m 1 


If estimation of statistics is not accurate, using more features does not 
necessarily increase classification accuracy. The so-called Hughes 
phenomenon occurs in practice when the number of training samples is not 
enough for the number of features (Swain and Davis 1978). Figure 3.16 shows 
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a graph of classification accuracy vs. number of features. There are 6 classes 
and Table 8.1 in Chapter 8 provides information on the 6 classes. The original 
60 dimensional data are reduced to different numbers of feature sets using a 
simple band combination procedure which will be referred as Uniform Feature 
Design. For example, if the number of features is to be reduced from 60 to 30, 
every two consecutive bands are combined to form a new feature. In other 
words, the i-th feature of a new feature set is given by 

Yj = X 2 *j-i + X 2 *j 

Where the number of features desired is not evenly divisible into 60, the nearest 
integer number of bands is used. For example, for 9 features, the first 6 original 
bands were combined to create the first feature, then the next 7 bands were 
combined to create the next feature, and so on. 

In the test, 100 training samples are used to estimate the statistics and 
the rest are used for test data. As can be seen, the classification accuracy 
peaked at about 29 features. After 29 features, adding more features decreases 
the classification accuracy. In fact, the classification accuracy is saturated at 
about 17-20 features. As a result, in the following experiments using the FSS 
data, the original 60 dimensional data are reduced to 17--20 dimensional data 
using Uniform Feature Design. Then various feature extraction/selection 
methods are applied to the reduced data set. 
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Number of Features 

Figure 3.16 Classification accuracy vs. number of features. 


3.6.2.2 Experiments and Results 


Along with the proposed Decision Boundary Feature Extraction, five 
other feature selection/extraction algorithms, Uniform Feature Design, Principal 
Component Analysis (the Karhunen-Loeve transformation) (Richards 1986), 
Canonical Analysis (Richards 1986), feature selection using a statistical 
distance measure, and the Foley & Sammon method (Foley and Sammon 
1975) are tested to evaluate and compare the performance of the proposed 
algorithm. In the feature selection using a statistical distance measure, 
Bhattacharyya distance (Fukunaga 1990) is used. Feature selection using the 
statistical distance measure will be referred as Statistical Separability. The 
Foley & Sammon method is based on the generalized Fisher criterion (Foley 
and Sammon 1975). For a two class problem, the Foley & Sammon method is 
used for comparison. If there are more than 2 classes, Canonical Analysis is 
used for comparison. 
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In the following test, two classes are chosen from the data collected at 
Finney Co. KS. in May 3, 1977. Table 3.6 shows the number of samples in each 
of the two classes. In this test, the covariance matrices and mean vectors are 
estimated using 400 randomly chosen samples from each class and the rest of 
the data are used for test. Figure 3.1 7 shows the mean graph of the two classes. 
There is a relatively large difference in the mean vectors between the two 
classes. 


Table 3.6 Class description of data collected at Finney Co. KS. 



SPECIES 
WINTER WHEAT 
UNKNOWN CROPS 


No. of Sample 


691 

619 



Figure 3.17 Mean graph of the two classes of Table 3.6. 
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Figure 3.18 shows the performance comparison ot test data of the 5 
feature selection/extraction algorithms for different numbers of features. With 20 
features, the classification accuracy is about 94.1%. Decision Boundary Feature 
Extraction and the Foley & Sammon method achieve approximately the 
maximum classification accuracy with just one feature while the other feature 
setection/extraction algorithms need 7-8 features to achieve about the sa 
classification accuracy. 
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Table 3.7 shows the eigenvalues of the decision boundary feature matrix 
along with proportions and accumulations. The eigenvalues are sorted in the 
decreasing order. The classification accuracies obtained using the 
corresponding eigenvectors are also shown along with the normalized 
classification accuracies obtained by dividing the classification accuracies by 
the classification accuracy obtained using all features. The rank of the decision 
boundary feature matrix(Z DBFM ) must be decided. Although it is relatively easy to 
decide the rank for low dimensional generated data, it becomes less obvious for 
high dimensional real data. One may add eigenvalues until the accumulation 
exceeds 95% of the total sum and set that number of the eigenvalues as the 
rank of the I DBFM . Defined in this way, the rank of the Zqbfm would be 5. 
Alternatively, one may retain the eigenvalues greater than one tenth of the 
largest eigenvalue. In this way, the rank of the Iqbfm would be 4. We will 
discuss more about this problem later. 


Table 3.7 Eigenvalues of the Decision Boundary Feature Matrix of the 2 
classes of Table 3.6 along with proportions and accumulations. 
Ev.:Eigenvalue, Pro. Ev. Proportion of Eigenvalue, Acc. Ev.: 
Accumulation of Eigenvalues, Cl. Ac.: Classification Accuracy, 
N. Cl. Ac.:Normalized Classification Accuracy. 



Ev. 

Pro . Ev. 
(%> 

MBSM 

■SBZ9I 

K3EI 

■tent 

1 

0.994 

49.6 

49.6 

93.4 

97.9 

mm 

0.547 

27.3 

77.0 

94.3 

98.8 

mm 

0.167 

8.3 

85.3 

94.4 

99.0 


0.133 

6.6 

91.9 

95.0 

99.6 

tm 

0.066 

3.3 

95.2 

95.1 

99.7 

mm 

0.041 

2.1 

97.3 

94.9 

99.5 

MM 

0.020 

1.0 

98.3 

94.9 

99.5 

El 

0.012 

0.6 

98.8 

94.8 

99.4 

El 

0.008 

0.4 

99.2 

95.0 

99.6 

■El 

0.007 

0.3 

99.6 

95.3 

99.9 

m 

0.005 

0.2 

99.8 

95.3 

99.9 

ma 


0.1 

99.9 

95.7 

100.3 

■El 

0.001 

0.0 

99.9 

95.5 

100.1 

ma 

0.001 

0.0 

100.0 

95.4 

100.0 

ma 

0.000 

0.0 

100.0 

95.3 

99.9 

ma 

0.000 

0.0 

100.0 

95.6 

100.2 

mm 

0.000 

0.0 

100.0 

95.5 

100.1 

na 

0.000 

0.0 

100.0 

95.5 | 

100.1 

ma 

0.000 

0.0 

100.0 

95.4 

100.0 

El 

0.000 

0.0 

100.0 

95.4 

100.0 


In the following test, two classes are chosen from the data collected at 
Hand Co. SD. on May 15, 1978. Table 3.8 shows the number of samples in 
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each of the two classes. Figure 3.19 shows the mean graph of the two classes. 
As can be seen, the mean differences are relatively small. In this test, all data 
are used for training and test since the number of available samples is very 

limited. 


Table 3.8 Class description of data collected at Hand Co. SD. 


SPECIES 

No. of Sample | 

WINTER WHEAT 

~ 223 

SPRING WHEAT 

474 



Figure 3.20 show the performance comparison of the 5 feature 
selection/extraction algorithms for different numbers of features. With 20 
features, the classification accuracy is 91.1%. In this case, the Foley & Sammon 
method performs less well due to the small class mean difference. Statistical 
Separability performs similarly. However, Decision Boundary Feature Extraction 
out-performs all other methods. Decision Boundary Feature Extraction achieves 
approximately 90% classification accuracy with 9 features while the other 
feature selection/extraction algorithms need 15-18 features to achieve 90% 
classification accuracy. 
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Number of Features 

Figure 3.20 Performance comparison of Uniform Feature, Principal 
Component Analysis, Canonical Analysis, Statistical 
Separability, and Decision Boundary Feature Extraction. 

In the following test, 4 classes are chosen from the FSS data. Table 3.9 
provides data on the 4 classes. Figure 3.21 shows the mean graph of the 4 
classes. In this test, 300 randomly selected samples are used for training and 
the rest are used for test. 
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Table 3.9 Class description. 


SPECIES 

DATE 

No. of Samples j 

Winter Wheat 

May 3, 1977 

^ 657 

Unknown Crops 

May 3. 1977 

678 

Winter Wheat 

■B3H1 

691 

Unknown Crops 

March 8, 1 977 

619 



o 10 20 30 40 50 60 


Spectral Bands 

&■— Winter Wheat, May 3 1977 

• — Unknown Crops, May 3 1977 

0-— Winter Wheat, March 8 1 977 
» — Unknown Crops, March 8 1977 

Figure 3.21 Mean graph of the two classes of Table 3.9. 


Figure 3.22 shows the performance comparison of the 5 feature 
selection/extraction algorithms for different numbers of features. Decision 
Boundary Feature Extraction achieves approximately 90% classification 
accuracy with 3 features while Canonical Analysis achieves about 87.5% 
classification accuracy with 3 features. On the other hand, Statistical 
Separability achieves about 87.5% with 5 features. Both Uniform Feature 
Design and Principal Component Analysis perform poorly. 
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Number of Features 

Figure 3.22 Performance comparison of Uniform Feature, Principal 
Component Analysis, Canonical Analysis, Statistical 
Separability, and Decision Boundary Feature Extraction. 


In the following test, 4 classes are chosen from the data collected at 
Hand Co. SD. on May 15, 1978. Table 3.10 shows the number of samples in 
each of the 4 classes. Figure 3.23 shows the mean graph of the 4 classes. As 
can be seen, the mean difference is relatively small among some classes. In 
this test, ail data are used for training and test. 
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Table 3.10 Class description. 


| Species 

Date 

No. of Samples 

Winter Wheat 

May 15, 1978 

223 

Native Grass Pas 

May 15, 1978 

196 

Oats 

May 15, 1978 

163 

Unknown Crops 

May 15, 1978 

253 



Spectral Band 




Winter Wheat 
Native Grass Pas 
Oats 

Unknown Crops 


Figure 3.23 Mean graph of the two classes of Table 3.10. 


Figure 3.24 shows the performance comparison of the 5 feature 
selection/extraction algorithms for different numbers of features. The 
classification accuracy using 20 features is about 88%. In this case, Canonical 
Analysis performs less well since class mean differences are relatively small. 
The performance of Decision Boundary Feature Extraction is much better than 
those of the other methods. Decision Boundary Feature Extraction achieves 
approximately 87.5% classification accuracy with 1 1 features while the other 
methods need 17-20 features to achieve about the same classification 
accuracies. 
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Number of Features 

Figure 3.24 Performance comparison of Uniform Feature Design, Principal 
Component Analysis, Canonical Analysis, Statistical 
Separability, and Decision Boundary Feature Extraction. 

In the following test, 4 classes are chosen from the data collected at 
Hand Co. SD. on August 16, 1978. Table 3.11 shows the number of samples in 
each of the 4 classes. Figure 3.25 shows the mean graph of the 4 classes. In 
this test, 100 randomly selected samples are used for training and the rest are 
used for test. 
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Table 3.11 Class description. 


I SPECIES 

DATE 

No. of Samples 

Other Crops 

Auqust 16, 1978 

199 

Native Grass Pas 

Auqust 16, 1978 

212 

Oats 

Auqust 16, 1978 

165 

Summer Fallow 

Auqust 16, 1978 

216 



o 10 20 30 40 50 60 

Spectral Band 

— Other Crops 

• — Native Grass Pas 

• — Oats 

— ♦ — Summer Fallow 

Figure 3.25 Mean graph of the two classes of Table 3.1 1 . 

Figure 3.26 show the performance comparison of the 5 feature 
selection/extraction algorithms for different numbers of features. Decision 
Boundary Feature Extraction achieves 95% classification accuracy with 4 
features while the classification accuracy of Canonical Analysis with 3 features 
is 93%. The performances of Uniform Feature Design and Principal Component 
Analysis are poor compared with the other methods. 
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Figure 3.26 Performance comparison of Uniform Feature, Principal 

Component Analysis, Canonical Analysis, Statistical Separability, 
and Decision Boundary Feature Extraction. 


In the following test, 6 are classes chosen from the FSS data. Table 3.12 
provides information on the 6 classes. Figure 3.27 shows the mean graph of the 
6 classes. In this test, 300 samples are used for training and the rest are used 
for test. 
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Table 3.12 Class description of the multi-temporal 6 classes. 


Date I Location I Species 1 No. Sample 


770308 I Finney CO KS. I Winter Wheat 1 691 


770626 1 Finney CO. KS. I Winter Wheat I 677 


771018 1 Hand CO. SD. I Winter Wheat 1 662 


770503 1 Finney CO. KS. I Winter Wheat I 658 


770626 I Finney CO KS. 1 Summer Fallow I 643 


780726 1 H and CO. SD, 1 Spring Wheat | 511 



Spectral Band 

Winter Wheat 770308 
Winter Wheat 770626 




Winter Wheat 771018 
Winter Wheat 770503 


Summer Fallow 


Spring Wheat 


Figure 3.27 Mean graph of the two classes of Table 3.12. 


Figure 3.28 shows the performance comparison of the 5 feature 
selection/extraction algorithms for different numbers of features. The 
classification accuracy using all features is 96.2%. Decision Boundary Feature 
Extraction achieves 94.2% classification accuracy with 5 features while the 
classification accuracy of Canonical Analysis with 5 features is 92.2 /o. 
Statistical Separability needs 11 features to achieve 94.2%. The performances 
of Uniform Feature Design and Principal Component Analysis are poor. 
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0 2 4 6 8 10 12 14 16 18 20 


Number of Features 

Figure 3.28 Performance comparison of Uniform Feature, Principal 

Component Analysis, Canonical Analysis, Statistical Separability, 
and Decision Boundary Feature Extraction. 

In the following test, 12 classes are chosen from the FSS data. Table 
3.13 shows the number of samples in each of the 12 classes. The data is multi- 
temporal. 
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Table 3.13 Class description of the multi-temporal 12 classes. 


| Date 

Location 

Species 

No. Sample | 

770308 

Finney CO. KS. 

Winter Wheat 

691 

770626 

Finney CO. KS. 

Winter Wheat 

677 

771018 

Hand CO. SD. 

Winter Wheat 

662 

770503 

Finney CO. KS. 

Winter Wheat 

658 

770626 

Finney CO. KS. 

Summer Fallow 

643 

780726 

Hand CO. SD. 

Spring Wheat 

518 

780602 

Hand CO. SD. 

Spring Wheat 

517 

780515 

Hand CO. SD. 

Spring Wheat 

474 

780921 

Hand CO. SD. 

Spring Wheat 

469 

780816 

Hand CO. SD. 

Spring Wheat 

464 

780709 

Hand CO. SD. 

Spring Wheat 

454 

781026 

Hand CO. SD. 

Spring Wheat 

441 



f I * W * i 

Figure 3.29 Performance comparison, 12 pattern classes. 
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Figure 3.29 show the performance comparison of the 5 feature 
selection/extraction algorithms for different numbers of features. In this case, 
Decision Boundary Feature Extraction and Canonical Analysis show 
comparable performances, although Decision Boundary Feature Extraction 
shows a little better performance than Canonical Analysis when more than 8 
features are used. Statistical Separability shows a relatively good performance. 
It is noted that, as more features are used, the performances of 5 feature 
selection/extraction algorithms continue to improve. 

In the next test, 40 classes are chosen from the FSS data. Table 3.14 
provides information on the 40 classes. The data is multi-temporal. Figure 3.30 
shows the performance comparison of the 5 feature selection/extraction 
algorithms for different numbers of features. In this case, Canonical Analysis, 
Statistical Separability and Decision Boundary Feature Extraction show 
essentially equivalent performances. In addition, as more features are used, the 
classification accuracies of the 5 feature selection/extraction algorithms 
continue to improve, suggesting that, for a large number of classes, a large 
number of features are also needed to discriminate between classes. In such a 
large number of classes, the fast classification algorithm in Chapter 2 can be 
employed. 
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Table 3.14 Class description of the multi-temporal 40 classes. 


Date 


770308 


770626 


771018 


770503 


770626 


780726 


780602 


780515 


780921 


780816 


780709 


781026 


760928 


781026 


771018 


770920 


780921 


770308 


760928 


780921 


780515 


780726 


781026 


780816 


780602 


780816 


770503 


780726 


780515 


780709 


771018 


780921 


780726 


780709 


780816 


780515 


780709 


771018 


770626 


Location 




Finnev Co. KS 


Hand Co. SD 


Finnev Co. KS 


Finnev Co. KS 


Hand Co. SD 


Hand Co. SD 


Hand Co. SD 


■ ! RTiT. I 


Hand Co. SD 


Hand Co. SD 


Hand Co. SD 




m 


MiBBtife— | 

MEGEEEsil 


Hand Co. SD 


Hand Co. SD 




Finnev Co. KS 


Hand Co. SD 




Hand Co. SD 


Hand Co. SD 


Hand Co. SD 


Hand Co. SD 


Hand Co. SD 




I Co. SD 


I Co. SD 


Co. SD 


I Co. SD 


I Co. SD 


I Co. SD 


I Co. SD 


I Co. SD 


l Co. SD 


I Co. SD 


i—EEai 


Finnev Co. KS 


Winter Wheat 


Winter Wheat 


Winter Wheat 


Winter Wheat 


Summer Fallow 


Wheat 


Wheat 


Soring Wheat 


Spring Wheat 


Spring Wheat 


Wheat 


Wheat 


Summer Fallow 


Winter Wheat 


Soring Wheat 


Winter Wheat 


Winter Wheat 


Grain Sorghum 


Grain Sorghum 


Oats 


Pasture 


Winter Wheat 


Native Grass Pas 


Pasture 


Summer Fallow 


Native Grass Pas 


Summer Fallow 


Summer Fallow 


Summer Fallow 


Native Grass Pas 


Oats 


Oats 


Oats 


Oats 


Oats 


Oats 


Grain Sorghum 



of Data 


691 


677 


662 


658 


643 


518 


517 


474 


469 


464 


454 


441 


414 


393 


313 


292 


292 


279 


277 


259 


225 


223 


217 


217 


216 


214 


212 
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Number of Features 

Figure 3.30 Performance comparison, 40 pattern classes. 

3.6.3 Eigenvalues of Decision Boundary Feature Matrix and Classification 
Accuracy 

Theoretically, the eigenvectors of the decision boundary feature matrix 
corresponding to non-zero eigenvalues will contribute to improvement of 
classification accuracy. However, in practice, a threshold must be set to 
determine the effectiveness of eigenvectors by the corresponding eigenvalues, 
especially for high dimensional real data. Figure 3.31 shows the relationship 
between the accumulation of eigenvalues of the decision boundary feature 
matrix and the normalized classification accuracies obtained by dividing the 
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classification accuracies by the classification accuracy obtained using all 
features. There is a nearly linear relationship between normalized classification 
accuracy and accumulation of eigenvalues up to x=95 where x is the 
accumulation of eigenvalues. As the accumulation of eigenvalues approaches 
100 percent, the linear relationship between the normalized classification 
accuracy and the accumulation of eigenvalues does not hold; care must be 
taken to set the threshold. More experiments are needed to obtain a better 
understanding on the relationship between the normalized classification 
accuracy and the accumulation of eigenvalues. 



x: Accumulation of Eigenvalues (%) 

Figure 3.31 Relationship between the normalized classification accuracy 
(see text) and the accumulation of eigenvalues. 


3.6.4 Decision Boundary Feature Extraction Method and the Foley & Sammon 
Method in High Dimensional Space 

The Foley & Sammon method will find an optimum feature set if there is a 
reasonable class mean difference. However, the Foley & Sammon method fails 
if the class mean differences are small. Another problem with the Foley & 
Sammon method is that it does not take full advantage of information contained 


83 



3 Decision Boundary Feature Extraction 


in the second order statistics. In the Foley & Sammon method (Foley and 
Sammon 1975), a new feature vector d is found to maximize R(d) 


R(d) = 


(d'A) 2 

d'Ad 


where d = column vector on which the data are projected; 

A = M,- M 2 and is estimated mean of class Wj. 

A = cZ, + (1 -c)I 2 and 0 < c < 1 and Zj is estimated covariance of class C 0 j. 

By using the lumped covariance A in the criterion, the Foley & Sammon method 
may lose some information contained in the difference of the class covariances. 
In a high dimensional space, information contained in the second order 
statistics play a significant role in discriminating between classes as shown in 
Chapter 7. 



Number of Features 

Figure 3.32 Performance comparison of the Gaussian ML classifier, the 
Gaussian ML classifier with zero mean, and the minimum 
distance classifier, tested on 40 multi-temporal classes. 

Figure 3.32 shows an example. Three classifiers are tested on different 
numbers of features. The first classifier is the Gaussian ML classifier which 
utilizes both class mean and class covariance information. In the second test, 
the mean vectors of all classes were made zero and the Gaussian ML classifier 
was applied to the zero mean data. In other words, the second classifier, which 
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is a Gaussian ML classifier, is constrained to use only covariance differences 
among classes. The third classifier is a conventional minimum distance 
classifier (Richards 1986) which utilizes only the first order statistics. It is 
noteworthy that the classifier using only first order statistics outperformed that 
using only second order statistics when the dimensionality was low. However, 
saturation soon set in, while performance of the classifier using only covariance 
information improved as more features were used. The implication seems to be 
that at low dimensionality the relative location of class distributions in feature 
space dominates in importance, but at higher dimensionality, the relative shape 
of the distribution dominates and in the long run is more significant to class 
separation. 

In order to evaluate the performances of the Foley & Sammon method 
and Decision Boundary Feature Extraction for various mean differences in high 
dimensional space, the following test is done. Two classes are selected from 
FSS data. Table 3.15 shows the data on the two classes. In this test, all data are 
used for training and test. 


Table 3.15 Class description. 


SPECIES 

No. of Sample 

Date 

1: WINTER WHEAT 

658 

May 3. 1977 

2: WINTER WHEAT 

393 

Oct. 26. 1978 


In the test, the mean of one class is moved relative to the mean of the other 
class. And performances of the Foley & Sammon method and Decision 
Boundary Feature Extraction are evaluated for various mean difference (0.5cr < 
A = |M, - M 2 | £ 5a) where a is the average standard deviation, i.e., 

1 2 n i 

«-2n5^ 

where N is the number of features and aj is j-th feature standard deviation of 
class o)j. 

Figures 3.33 and 3.34 shows the performances of the Foley & Sammon method 
and Decision Boundary Feature Extraction for various mean differences. First, it 
is noted that even when there is small mean difference (A = 0.5a), classification 
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accuracy can be almost 100%. This shows again that information contained in 
the second order statistics plays a significant role in discriminating between 
classes in high dimensional space. As can be seen in Figure 3.33, the Foley & 
Sammon method fails to find a good feature set if the mean differences are 
relatively small (A^ 2.5a). After there are sufficient mean differences (A>3 g), the 
Foley and Sammon method begins to find a good feature set. On the other 
hand, Decision Boundary Feature Extraction works well even when the mean 
differences are small and finds a good feature set utilizing the covariance 
differences as can be seen in Figure 3.34. 



0 2 4 6 8 10 12 14 16 18 20 


Number of Features 

Figure 3.33 Performance comparison of the Foley & Sammon method for 
various mean differences. 
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Number of Features 

Figure 3.34 Performance comparison of Decision Boundary Feature 
Extraction for various mean differences. 


Table 3.16 shows the number of features needed to achieve over 97% of 
the classification accuracy obtained using all features for the various mean 
differences. If A>4a, both methods find a proper feature sets, achieving over 
97% of the classification accuracy obtained using all features with one feature. 
For 3a<A<3.5a, Decision Boundary Feature Extraction achieves over 97% of 
the classification accuracy obtained using all features with just one feature 
while the Foley & Sammon method needs 4-5 features. When A^2.5a, the Foley 
& Sammon method performs poorly while Decision Boundary Feature 
Extraction achieves over 97% of the classification accuracy obtained using all 
features with 2, 3, 5, 4, and 5 features for A=2.5a, 2a, 1.5a, a, and 0.5a, 

respectively. 
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Table 3.16 Number of features needed to achieve over 97% of the 
classification accuracy obtained using all features for the various 

mean differences. a=average standard deviation. 


Mean Difference 

0.5 o 

1o 

1.5o 

2 a 

2.5a 

3a 

3.5a 

4a 

4.5a 

5a 

Foley & Sam mo n 

19 

18 

14 

11 

9 

5 

4 

1 

1 

1 

Decision Boundary 

5 

4 

5 

3 

2 

1 

1 

1 

1 

1 


3.7 Conclusion 

We have proposed a new approach to feature extraction for classification 
based on decision boundaries. We defined discriminantly redundant features 
and discriminantly informative features for the sake of feature extraction for 
classification and showed that the discriminantly redundant features and the 
discriminantly informative features are related to the decision boundary. By 
recognizing that normal vectors to the decision boundary are discriminantly 
informative, the decision boundary feature matrix was defined using the normal 
vectors. It was shown that the rank of the decision boundary feature matrix is 
equal to the intrinsic discriminant dimension, and the eigenvectors of the 
decision boundary feature matrix corresponding to non-zero eigenvalues are 
discriminantly informative. We then proposed a procedure to calculate 
empirically the decision boundary feature matrix. 

Except for some special cases, the rank of decision boundary feature 
matrix would be the same as the original dimension. However, it was noted that 
in many cases only a small portion of the decision boundary is effective in 
discriminating among pattern classes, and it was shown that it is possible to 
reduce the number of features by utilizing the effective decision boundary rather 
than the complete boundary. 

The proposed feature extraction algorithm based on the decision 
boundary has several desirable properties. The performance of the proposed 
algorithm does not deteriorate even when there is little or no mean difference or 
covariance difference. In addition, the proposed algorithm predicts the minimum 
number of features required to achieve the same classification accuracy as in 
the original space for a given problem. Experiments show that the proposed 
feature extraction algorithm finds the right feature vectors even in cases where 
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some previous algorithms fail to find them, and the performance of the proposed 
algorithm compares favorably with that of several previous algorithms. 

Developments with regard to sensors for Earth observation are moving in 
the direction of providing much higher dimensional multispectral imagery than 
is now possible. The HIRIS instrument now under development for the Earth 
Observing System (EOS), for example, will generate image data in 192 spectral 
bands simultaneously. In order to analyze data of this type, new techniques for 
all aspects of data analysis will no doubt be required. The proposed algorithm 
provides such a new and promising approach to feature extraction for 
classification of such high dimensional data. 

Even though the experiments are conducted using Gaussianly 
distributed data or assuming a Gaussian distribution, all the developed 
theorems hold for other distributions or to other decision rules as well. In 
addition, it will be shown in the next chapter how the proposed algorithm can be 
also applied for non-parametric classifiers if the decision boundary can be 
found numerically. 
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CHAPTER 4 DECISION BOUNDARY FEATURE EXTRACTION FOR NON- 

PARAMETRIC CLASSIFICATION 


4.1 Introduction 

Although many authors have studied feature extraction for parametric 
classifiers (Deceit and Guseman 1979), relatively few algorithms are available 
for non-parametric classifiers. The lack of practical feature extraction algorithms 
for the non-parametric classifier is mainly due to the nature of a non-parametric 
classifier. Without an assumption about the underlying density functions, feature 
extraction for non-parametric classifiers is often practically not feasible or very 
time consuming in many cases. 

Some general feature extraction methods could be used for non- 
parametric classifiers. Muasher and Landgrebe (1983) proposed a method to 
base feature extraction on the statistics of the whole data set. Although this is 
not optimal in a theoretical sense, it can be used even when underlying class 
densities are unknown, or precise estimates of them are not possible. In 
addition, such methods can be used for both parametric and non-parametric 
classifiers. Since, in many cases, it may be difficult to obtain enough training 
samples, feature extraction methods based on the whole data set may be a 
good and useful solution. 

In discriminant analysis (Fukunaga 1990), a within-class scatter matrix 
E w and a between-class scatter matrix I b are used to formulate a criterion 

function. A typical criterion is 


Jl — t*"(^w ^b) 


(4.1) 


PRECEDING PAGE BLANK NOT FILMED 
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where I w is the within-class scatter matrix and Z b is the between-class scatter 
matrix as defined in Section 3.2. New feature vectors are selected to maximize 
the criterion. Fukunaga proposed a non-parametric discriminant analysis which 
is based on non-parametric extensions of commonly used scatter matrices 
(Fukunaga and Mantock 1983). Patrick proposed a non-parametric feature 
extraction process where a non-quadratic distance function defined between 
classes is used to define the best linear subspace (Patrick and Fischer II 1969). 

Features can be selected under a criterion which is related to the 
probability of error. The Bhattacharyya distance is a measure of statistical 
separability and is defined as follows (Fukunaga 1990): 

H(f) - -in J [ p(X/o>,)p(X/o> 2 ) ] 2dX (4.2) 

Although theoretically it is possible to calculate equation (4.2) for a non- 
parametric classifier such as Parzen density estimator, in practice, it is 
frequently not feasible due to a prohibitively long computing time, particularly for 
high dimensional data. 

Short and Fukunaga showed that, by problem localization, most pattern 
recognition problems can be solved using simple parametric forms, while global 
parametric solution may be untractable (Fukunaga and Short 1978). Short and 
Fukunaga also proposed a feature extraction algorithm using problem 
localization (Short and Fukunaga 1982). They considered feature extraction as 
a mean-square estimation of the Bayes risk vector. The problem is simplified by 
partitioning the distribution space into local subregions and performing a linear 
estimation in each subregion. 

Though the computation cost of non-parametric classifiers is often much 
larger than that of parametric classifiers, there are some cases where the use of 
non-parametric classifiers is desirable. For instance, if underlying densities are 
unknown or problems involve complex densities which cannot be approximated 
by the common parametric density functions, use of a non-parametric classifier 
may be necessary. However, for high dimensional data and multi-source data, 
the computation cost of non-parametric classifiers can be very large. As a result, 
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there is a greater need for a practical feature extraction algorithm which can 
take a full advantage of non-parametric classifiers which can define an arbitrary 
decision boundary. 

In this chapter, we extend the decision boundary feature extraction 
method in Chapter 3 to non-parametric cases (Lee and Landgrebe 1991-1). 
The method is based directly on the decision boundary. Instead of utilizing 
distributions of data, we explore the decision boundary which the employed 
classifier defines. It has been shown that all feature vectors which are helpful in 
discriminating between classes can be obtained from the decision boundary 
(Lee and Landgrebe 1991-2). Thus, by extracting features directly from the 
decision boundary which a non-parametric classifier defines, one can fully 
explore the advantage of the non-parametric classifier. Since the decision 
boundary can not be expressed analytically in the non-parametric case, the 
proposed algorithm finds points on the decision boundary numerically. From 
these points, feature vectors are extracted. The proposed algorithm predicts the 
minimum number of features to achieve the same classification accuracy as in 
the original space while at the same time finding the needed feature vectors. 


4.2 Decision Boundary Feature Extraction for Non-Parametric Classification 

4.2.1 Effective Decision Boundary in Non-Parametric Classifiers 

In Chapter 3, we defined the effective decision boundary for parametric 
classifiers as follows (see Definition 3.4): 

Definition 3.4 The effective decision boundary is defined as 

{ X | h(X) = t , X € Rt or X e R 2 } 

where R-) is the smallest region which contains a certain portion, Pthreshoid< of 
class co 1 and R 2 is the smallest region which contains a certain portion, 

^threshold' Of ClaSS G);,. 

Also, the effective decision boundary feature matrix was defined as follows: 
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Definition 3.7 The effective decision boundary feature matrix (EDBFM): Let 
N(X) be the unit normal vector to the decision boundary at a point X on 
the effective decision boundary for a given pattern classification problem. 
Then the effective decision boundary feature matrix I EDBFM is defined as 

U.4 |N(X)N'(X)p(X)dX 
S' 

where p(X) is a probability density function, K= Jp(X)dX, and S' is the 

S’ 

effective decision boundary as defined in Definition 3.4, and the integral 
is performed over the effective decision boundary. 

In parametric classifiers, assuming Gaussian distributions, the above 
definitions gives a proper meaning. However, in non-parametric classifiers, the 
above definitions may not give a correct effective decision boundary when the 
problem involves outliers or some special multimodal cases. 


Decision Boundary 1 



a small portion 
of class 0)2 


Figure 4.1 Effective decision boundary in non-parametric classifiers. 

Figure 4.1 illustrates such an problem. In Figure 4.1, Decision Boundary 1 
should be the effective decision boundary to be considered in calculating the 
decision boundary feature matrix. However, according the Definition 3.7, 
Decision Boundary 2 will be more heavily weighted. As a result, inefficiency will 
be introduced in the calculated decision boundary feature matrix. 
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To overcome such a problem in non-parametric classifiers, the definition 
of the effective decision boundary (Definition 3.4) needs to be generalized. We 
can define the effective decision boundary as the portion of the whole decision 
boundary which separates most of the data in the same way as the whole 
decision boundary separates. To be more precise, we generalize the definition 
of the effective decision boundary as follows: 

Definition 4.1 The effective decision boundary of P portion defined as the 

portion of the whole decision boundary which separates Pportion of the data 
in the same way as the whole decision boundary separates. 

It is noted that Definition 4.1 holds for parametric and non-parametric classifiers 
and gives a proper physical meaning. It can be viewed that Definition 3.4 is a 
special case of Definition 4.1 assuming Gaussian distribution. With the effective 
decision boundary as in Definition 4.1, the definition of the effective decision 
boundary feature matrix (Definition 3.7) will give a relevant result for non- 
parametric classifiers even when the problem involves outliers. However, as will 
be seen, it is more difficult to locate the effective decision boundary in non- 
parametric classifiers than in parametric classifiers. We will discuss this problem 
in detail later. 

4.2.2 Parzen Density Estimation and Selection of Kernel Size 

A non-parametric classifier with Parzen density estimation will be used to 
test the proposed feature extraction algorithm for non-parametric classification, 
thus we will briefly discuss Parzen density estimation. Parzen density estimation 
with kernel <p is defined as (Duda and Hart 1 973) 

1 " X - Xj 

P(X) = ^ — s - ’ 

where N is the dimensionality of the data, and h is the window size, and n is the 
number of training samples. The kernel cp must be non-negative and satisfy the 

following condition: 
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Although many authors have studied the problem of determining the 
value of the Parzen scale parameter h, no theoretical value of h gives 
consistently optimum results (Fukunaga and Hummels 1987). As a result, we 
determined the best h experimentally in our experiments. Figure 4.2 shows the 
classification results for various h. The peak performance occurs when h is 
between 0.5 and 0.7 in this case. 



Figure 4.2 Determining the best h experimentally. 

4.2.3 Determining the Decision Boundary and Finding Normal Vectors to the 
Decision Boundary for Non-Parametric Classifiers 

In order to extract feature vectors from the decision boundary of a given 
classifier, we need to calculate the decision boundary feature matrix Lqbfm as 
given in Definition 3.6. Then Theorem 3.2 and Theorem 3.3 tell us that the 
eigenvectors of Z DBFM corresponding to non-zero eigenvalues of Zqbfm are all 
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the feature vectors needed for discriminating between the classes for the given 
classifier as shown in Chapter 3. In order to calculate the decision boundary 
feature matrix E 0BFM , the decision boundary must be found. However, in 
general, a non-parametric classifier defines an arbitrary decision boundary 
which may not be expressed in analytic form. Therefore the decision boundary 
for non-parametric classifiers must be calculated numerically. 


In section 3.4.3, we defined the decision boundary as follows (Definition 


3.3): 


{XI h(X)=t} 


where 


h(X) = -In 


P(X|co 1 ) 
P(X|o) 2 ) 


t = In 


P((02) 


(4.3) 

(4.4) 


X 


Figure 4.3 Finding decision boundary numerically for non- 
parametric classifiers. 

Consider an example in Figure 4.3. Assuming X and Y are classified differently, 
the line connecting X and Y must pass through decision boundary. Although, 
by moving along the line, we can find a point Z at which h(Z)=t, there is no 
guarantee that the point Z is exactly on the true decision boundary, even though 
h(Z)=t. Figure 4.4 shows an example. In the example, data are generated for the 

following statistics. 
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o Class 1 A Class 2 

■ Decision boundary found numerically 

Figure 4.4 Finding decision boundary numerically. 

The points of the decision boundary found numerically are shown along with 
the true decision boundary plotted as a straight line in Figure 4.4. As can be 
seen, the points of the numerically found decision boundary are distributed 
along the true decision boundary. However, the points are not exactly on the 
true decision boundary. The problem that the numerically found decision 
boundary does not match exactly the true decision boundary becomes more 
apparent when training samples are limited or the Parzen scale parameter h is 
small. However, in our experiments, we found that inaccurate estimation of the 
decision boundary has relatively little impact on the performance of the decision 
boundary feature extraction method for non-parametric classifiers if the 
estimated decision boundary is in the vicinity of the true decision boundary. We 
will discuss this problem more in the experiments. 


A normal vector to the decision boundary at X is given by 


ah 


ah 


Vh(X) = ax7 Xl + dx 2 X2 + 


- 98 - 


ah 

+ a^ x " 


(4.5) 






4 DBFE- Non-Parametric 


However, in non-parametric classifiers, the decision boundary can not be 
expressed analytically and equation (4.5) can not be used. Instead, we may 
estimate the normal vector as follows: 


n , , v . Ah Ah 

Vh <X)-^x, +S ; 


X2 + 


Ah 

+ AXn X " 


(4.6) 


A problem of estimating a normal vector numerically is that the nearest samples 
have often much influence on the estimation of normal vectors. This problem 
becomes more apparent when training samples are limited or the Parzen scale 
parameter h is small. As a result, care must be taken in selecting the Parzen 
scale parameter h, particularly in a high dimensional space. We will discuss this 
problem more in the experiments. 


4.2.4 Decision Boundary Feature Extraction Procedure for Non-Parametric 
Classification 

Now we propose the following procedure to find decision boundary 
numerically and calculate the decision boundary feature matrix for non- 
parametric classifiers. 

Procedure for Feature Extraction for Non-Parametric Classifier 
Utilizing the Decision Boundary 
( 2 pattern class case) 

STEP 1 : Classify the training data using full dimensionality. 

STEP 2: For each sample correctly classified as class to,, find the nearest 
sample correctly classified as class co 2 . Repeat the same procedure 
for the samples correctly classified as class coj. 

STEP 3: Connect the pairs of samples found in STEP 2. Since a pair of 
samples are classified differently, the line connecting the pair of 
samples must pass through the decision boundary. By moving along 
the line, find the point on the decision boundary or near the decision 
boundary within a threshold. 

STEP 4: At each point found in STEP 3, estimate the unit normal vector N| by 

Nj = Vh(X) / |Vh(X)| 
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where 


Ah 


Ah 


Vh W-A^ Xl + A ^ X2 + 


AX2 



h(X) = -In 


PjXjcgil 

P(X|co 2 ) 


assuming Bayes’ decision rule for 


minimum error is used. 


STEP 5: Estimate the decision boundary feature matrix using the normal 
vectors found in STEP 4. 

Eedbfm - N i N i where L is the number of P° ints found on the 

i 

decision boundary. 

STEP 6: Select the eigenvectors of the decision boundary feature matrix as 
new feature vectors according to the magnitude of corresponding 
eigenvalues. 

Euclidean distance is used to find the nearest sample in STEP 2 in our 
experiments. Figure 4.5 shows an illustration of the proposed procedure. 
Although the proposed procedure does not find the decision boundary where 
data are sparsely distributed, this is an advantage, not a disadvantage of the 
procedure. By concentrating on the decision boundary where most of data are 
distributed, the feature extraction can be more efficient as shown in Chapter 3. 
The classification error increase resulting from not considering the decision 
boundary in the region where data are sparsely distributed will be minimal 
since there will be very little data in that region. 
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Figure 4.5 Illustration of the procedure feature extraction for a non-parametric 
classifier utilizing decision boundary. 


4.2.5 Outlier Problem 

If the two classes are very separable and one class has some outliers as 
shown in Figure 4.6, the proposed procedure will calculate the decision 
boundary feature matrix with more weight on the decision boundary between 
the outliers of class 0)2 and class ro-i (Decision Boundary 2) than the decision 
boundary between the main portion of class a >2 anc * class coi (Decision 
Boundary 1), assuming the outliers are correctly classified. Although such a 
case as in Figure 4.6 will not occur frequently in real applications, such outliers 
will make the proposed procedure less efficient since Decision Boundary 1 is 
the effective decision boundary in that case. However, it is not a fundamental 
problem of the decision boundary feature extraction algorithm, but a procedural 
problem of how to find the effective decision boundary (Definition 4.1). In 
parametric classifiers, such outliers could be eliminated using the chi-square 
threshold test. In non-parametric classifiers, it is more difficult to eliminate such 
outliers. 
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Decision Boundary 1 



0 

outliers of class co 2 
Figure 4.6 Outlier problem. 


To overcome the outlier problem, STEP 2 in the proposed procedure can 
be modified as follows: 

STEP 2a: For each sample correctly classified as class select randomly a 
sample correctly classified as class co,. Repeat the same procedure 
for the samples correctly classified as class © 2 . 

By randomly selecting a sample classified as the other class, the decision 
boundary which separates the main portion of classes will be weighted more 
heavily, and inefficiency caused by outliers, when classes are very separable, 
will be eliminated. 

However, if data are distributed as shown in Figure 4.7, STEP 2a will 
cause the inclusion of some ineffective decision boundary in calculating the 
decision boundary feature matrix, while STEP 2 can concentrate on the 
effective decision boundary. Thus, in the case of Figure 4.7, which is a more 
typical case in real data, STEP 2 will be more efficient than STEP 2a. The 
problem can be summarized as how to find the effective decision boundary 
even when there exists outliers. So STEP 2 in the proposed procedure can be 
modified as follows: 
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STEP 2b: For each sample correctly classified as class find the L nearest 
samples correctly classified as class co 2 . From the L nearest samples, 
select randomly a sample. Repeat the same procedure for the 
samples correctly classified as class co 2 . 

By increasing L, one can eliminate the outlier problem. By decreasing L, one 
can concentrate on the effective decision boundary. Thus, there will be a 
tradeoff between eliminating the outlier problem and concentrating on the 
effective decision boundary. As pointed out previously, the problem is how to 
find the effective decision boundary. If one can exactly locate the effective 
decision boundary, the decision boundary feature extraction algorithm will be 
more effective. 



Figure 4.7 A more typical data distribution and its decision boundary. 


4.2.6 Non-Parametric Classifiers Not Defining Probability Densities 

Some non-parametric classifiers such as the kNN classifier do not define 
class probability densities. If the employed non-parametric classifier does not 
define class probability densities, h(X) in equation (4.3) can not be calculated. 
In such a case, normal vectors can not be estimated. In that case, one might find 
a vector along which the classification result changes most rapidly. For 
example, let X be a point on the decision boundary. Then find the smallest Axj 
such that the classification result of X+AXjXj is different from that of X. We may 
then estimate a unit vector N along which the classification result changes most 
rapidly as follows: 
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N = V / |V| where V = ^;xi + + + Axn Xn 

4.2.7 Multiclass Cases 

If there are more than two classes, the procedure can be repeated for 
each pair of classes, and the total effective decision boundary feature matrix 
can be calculated by averaging the effective decision boundary feature matrices 
which are calculated for each pair of classes. If prior probabilities are available, 
the summation can be weighted. In other words, if there are M classes, the 
decision boundary feature matrix can be calculated as 

M M .. 

ZdBFM = X X 

i j. j*i 

where e'dbfm is the decision boundary feature matrix between 
class coj and class coj and P(coj) is the prior probability of class coj if 
available. Otherwise let P(a>j)=1/M. 


4.3 Decision Boundary Feature Extraction and Problem Localization 

By problem localization, Short and Fukunaga showed that most pattern 
recognition problems can be solved using simple parametric forms (Fukunaga 
and Short 1978). In (Short and Fukunaga 1982), Short and Fukunaga proposed 
a feature extraction method using problem localization. In their method, the 
original space is subdivided into a number of subregions and a linear 
estimation is performed in each subregion. A modified clustering algorithm is 
used to find the subregions. To a certain extent, the decision boundary feature 
extraction method parallels the problem localization approach. In problem 
localization, Short and Fukunaga recognized that a parametric discriminant 
function can be used in each subregion (Fukunaga and Short 1978). In the 
decision boundary feature extraction method, we recognized that only a small 
portion of the decision boundary plays a significant role in discriminating 
between classes. 
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Figure 4.8 Decision boundary and effective decision boundary. 



decision boundary 

Figure 4.9 Effective decision boundary and new decision boundary extended 
by the effective decision boundary. 

Consider the case of Figure 4.8. The effective decision boundary which is 
plotted in bold, plays a significant role in discriminating between classes. Even 
if the effective decision boundary is used, the data still can be classified in 
almost the same manner as when the whole decision boundary is used as 
shown in Figure 4.9. On the other hand, parts of the decision boundary, which 
are plotted as plain lines, play relatively little role in discriminating between 
classes while some part of the decision boundary, plotted as a dotted line, are 
rarely used. Therefore, we recognized that by concentrating on the effective 
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decision boundary, the feature extraction can be more efficient. It is noted that 
the effective decision boundary need not be linear or be represented by a 
parametric form. 

However, the decision boundary feature extraction method differs from 
the problem localization in several ways. First, the decision boundary feature 
extraction method does not divide the pattern space into subregions. Dividing 
the pattern space into subregions is not an easy task when the number of 
subregions is unknown. This problem becomes apparent particularly in a 
multiclass problem with real, high dimensional data. Secondly, the decision 
boundary feature extraction method finds a global feature set while a local 
feature set is found in the problem localization. Thirdly, in the problem 
localization, Short and Fukunaga take advantage of the fact that class 
boundaries are likely to be more nearly linear in each subregions while the 
decision boundary feature extraction method does not assume that the effective 
decision boundary is nearly linear or can be represented in a parametric form. 
In the decision boundary feature extraction method, the effective decision 
boundary can be of any shape. Finally the decision boundary feature extraction 
method has the capability to predict the minimum number of features needed to 
achieve the same classification accuracy as in the original space. 

4.4 Experiment and Result 

4.4.1 Experiments with generated data 

In order to evaluate closely how the proposed algorithm performs under 
various circumstances, tests are conducted on generated data with given 
statistics. The non-parametric classifier was implemented by Parzen density 
estimation using a Gaussian kernel function (Silverman 1986). In each 
example, classification accuracies of the decision boundary feature extraction 
method and the discriminant analysis using equation (4.1) as a criterion 
function are compared. We will refer the decision boundary feature extraction 
method as Decision Boundary Feature Extraction, and discriminant analysis 
using equation (4.1) as Discriminant Analysis. 

Example 4.1 In this example, class co-i is normal with the following statistics: 
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Class (o 2 is equally divided between two normal distributions with the following 
statistics: 



200 samples are generated for each class. Figure 4.10 shows the distribution of 
the data along with the decision boundary found by the proposed procedure 
numerically. Eigenvalues Xj and eigenvectors <|>j of Eedbfm are calculated as 

follows: 



T 0.691 

r°- 72 i 

^ =0.98338, \ 2 =0.01 662 

-e- 

II 

i — 
■ 

o 

ro 

= |_0-69_ 


Since one eigenvalue is significantly larger than the other, it can be said that 
the rank of I EDBFM is 1 • That maans on, Y one feature is needed to achieve the 

same classification accuracy as in the original space. Considering the statistics 
of the two classes, the rank of Eedbfm Qives the correct number of features 
needed to achieve the same classification accuracy as in the original space. 
Table 4.1 shows the classification accuracies of Decision Boundary Feature 
Extraction and Discriminant Analysis. Decision Boundary Feature Extraction 
finds the right features achieving the same classification accuracy with one 
feature while Discriminant Analysis performs significantly less well in this 
example since class means are the same. 

Table 4.1 Classification accuracies of Decision Boundary Feature Extraction 
and Discriminant Analysis in Example 4.1. 
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■ Decision boundary found numerically 

Figure 4.10 Data distribution of Example 4.1. The decision boundary found by 
the proposed procedure is also shown. 

Example 4.2 In this example, class is normal with the following statistics: 




Zi- 


9 0 O' 
0 9 0 
0 0 9 


Class 0)2 is equally divided between two normal distributions with the following 


statistics: 




1 0 0 
0 1 0 


0 0 9. 


r-3i ri o o' 

and M 2 2 = 0 E£ = 0 1 0 
Lo.iJ Lo 0 9. 


200 samples are generated for each class. From the statistics, it can be seen 
that the decision boundary approximately consists of two cylindrical surfaces. 
Figure 4.11 shows the distribution of the data in the x1-x2 plane. The decision 
boundary found by the proposed procedure numerically is also shown. 
Eigenvalues Xj and eigenvectors <j>j of Eedbfm are calculated as follows: 
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X 1 = 0.58993, X 2 = 0.33985, X 3 = 0.07021 


'0.25' 


’0.95' 


r-0.17 - 

-0.97 

. <t>2 = 

0.24 

» $3 = 

-0.04 

0.00 

0.17 

0.98 


Rank(X EDBFM ) * 2 


It can be said that the rank of I E dbfm «s approximately 2. Thus two features are 
needed to achieve the same classification accuracy as in the original space, 
which agrees with the data. Table 4.2 shows the classification accuracies of 
Decision Boundary Feature Extraction and Discriminant Analysis. Decision 
Boundary Feature Extraction finds the correct features achieving about the 
same classification accuracy with two features while Discriminant Analysis 
performs significantly less well, since there is no class mean difference. 



X 1 


A Class 1 ° Class 2 

■ Decision boundary found by the procedure 

Figure 4.1 1 Data distribution of Example 4.2. The decision boundary found by 
the proposed procedure is also shown. 
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Table 4.2 Classification accuracies of Decision Boundary Feature Extraction 
and Discriminant Analysis in Example 4.2. 


Number of 
Features 

Discriminant 

Analysis 

Decision Boundary 
Feature Extraction 

1 

61.5 (%) 

68.8 (%) 

2 

67.8 (%) 

76.3 (%) 

3 

76.0 (%) 

76.3 (%) 


4.4.2 Experiments with real data 

Real data sets were selected from a high dimensional multispectral 
remote sensing data base of agricultural areas. The data were collected by the 
Field Spectrometer System (FSS), a helicopter-mounted field spectrometer, as 
a part of the LAC IE program (Biehl et. al 1982). Table 4.3 shows the major 
parameters of FSS. 


Table 4.3 Parameters of Field Spectrometer System (FSS). 


Number of Bands 

60 

Spectral Coverage 

0.4 - 2.4 pm 

Altitude 

60m 

IFOV(ground) 

25m 


Along with the proposed algorithm, three other feature extraction 
algorithms, Uniform Feature Design, the Karhunen-Loeve transformation 
(Principal Component Analysis) (Duda and Hart 1973), and the discriminant 
analysis using equation (4.1) as a criterion function (Fukunaga 1990) are tested 
to evaluate and compare the performance of the proposed algorithm. Uniform 
Feature Design is a simple band combination procedure. For example, if the 
number of features is to be reduced to 30, every two consecutive bands are 
combined to form a new feature. Where the number of features desired is not 
evenly divisible into 60, the nearest integer number of bands is used. For 
example, for 9 features, the first 6 original bands were combined to create the 
first feature, then the next 7 bands were combined to create the next feature, 
and so on. Uniform Feature Design is used as a baseline means to evaluate 
efficiencies of the other feature extraction methods. The discriminant analysis 
using equation (4.1) is referred as Discriminant Analysis. 
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In the first test, 4 classes are chosen from the FSS data. Table 4.4 
provides information on the 4 classes. Figure 4.12 shows the mean graph of the 
4 classes. As can be seen, there are reasonable mean differences among the 
classes. In this test, 400 randomly selected samples are used for training and 
the rest are used for test. 


Table 4.4 Class description. 



DATE 




657 

Unknown Crops 

May 3, 1977 

678 

Winter Wheat 

March 8,1977 

691 

Unknown Crops 

March 8, 1977 

619 



Spectral Band 

Figure 4.12 Mean graph of the two classes of Table 4.4. 


Figure 4.13 shows a performance comparison. First the original 60 dimensional 
data is reduced to 17 dimensional data using Uniform Feature Design. And then 
Decision Boundary Feature Extraction, Discriminant Analysis, and Principal 
Component Analysis are applied to the 17 dimensional data. With 17 features, 
the classification accuracy is about 90.0%. In low dimensions (number of 
features <3 ), Discriminant Analysis performs better than the other methods. 


in 
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When more than 3 features are used, Decision Boundary Feature Extraction 
starts to performs better than the other methods. 


Figure 4.13 



Performance comparison of Uniform Feature Design, Decision 
Boundary Feature Extraction, Discriminant Analysis, and 
Principal Component Analysis. 


In the next test, there are 3 classes and each class has 2 subclasses. In 
other words, 2 subclasses were combined to form a new class. By purposely 
combining data from different classes, the data are made to be multi-modal. 
Table 4.5 provides information on the classes. Figure 4.14 shows a mean value 
graph of the 6 subclasses, and Figure 4.15 shows a mean value graph of the 3 
classes each of which has 2 subclasses. 500 randomly selected samples from 
each classes are used as training data and the rest are used for test. 
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Table 4.5 Class description. 


Class 

Subclass 


Total No. ol Sample | 

Class co 1 

Winter Wheat 
March 8, 1977 

691 

1209 

Spring Wheat 
July 26, 1978 

518 

Class o >2 

Winter Wheat 
June 26, 1977 

677 

1146 

Spring Wheat 
Sep. 21. 1978 

469 

Class 0)3 

Winter Wheat 
Oct. 18, 1977 

662 

1103 

Spring Wheat 
Oct. 26, 1978 

441 



Winter Wheat, March 1977 

Spring Wheat, July 1978 

d— Winter Wheat, June 1977 

— Spring Wheat, Sep. 1978 

• — Winter Wheat, Oct. 1977 

— Spring Wheat, Oct. 1978 


Figure 4.14 Mean graph of the 6 sub-classes of Table 4.5. 
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Figure 4.1 5 Mean graph of the 3 classes of Table 4.5. 


Figure 4.16 



Performance comparison of Uniform Feature Design, Decision 
Boundary Feature Extraction, Discriminant Analysis, and 
Principal Component Analysis of the data of Table 4.5 (test data). 


Figure 4.16 shows a performance comparison. With 17 features, the 
classification accuracy is about 89%. Discriminant Analysis shows the best 
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performances until 3 features are used. However, the classification accuracies 
are much lower than the maximum possible classification accuracy and the 
comparison seems to be irrelevant. Decision Boundary Feature Extraction 
shows consistently better performances when more than 3 features are used. 
Decision Boundary Feature Extraction achieves about 89% classification 
accuracy with 7 features while all other methods needs 13-17 features to 
achieve about the same classification accuracy. 

In the following test, there are 3 classes and each class has 2 
subclasses. In other words, 2 subclasses were combined to form a new class. 
By purposely combining data from different classes, the data are made to be 
multi-modal. Table 4.6 provides information on the classes. 500 randomly 
selected samples from each classes are used as training data and the rest are 

used for test. 

Table 4.6 Class description. 


Class 

Subclass 

No. of Samples 

Total No. of Sample 

Class co 1 

Winter Wheat 
May 3, 1977 

658 

1340 

Unknown Crops 
May 3. 1977 

682 

Class CO 2 

Winter Wheat 
March 8, 1977 

691 

1310 

Unknown Crops 
March 8, 1 977 

619 

Class (o 3 

Winter Wheat 
June 26, 1977 

677 

1320 

Summer Fallow 
June 26, 1977 

643 


Figures 4.17-18 show the performance comparison. First the original 60 
dimensional data was reduced to 17 dimensional data using Uniform Feature 
Design. And Decision Boundary Feature Extraction, Discriminant Analysis, and 
Principal Component Analysis were applied to the 17 dimensional data. With 
the 17 features, the classification accuracies of training data and test data are 
96.5% and 95.7%, respectively. In low dimensionality (N<2), Discriminant 
Analysis shows the best performances, though the difference between 
Discriminant Analysis and the decision boundary feature extraction method is 
small. However, when more than 2 features are used, the decision boundary 
feature extraction method outperforms all other methods. 
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inn 



Number of Features 


Figure 4.17 Performance comparison (train data). 



Figure 4.18 Performance comparison (test data). 


In the following test, there are 3 classes and each class has 2 
subclasses. In other words, 2 subclasses were combined to form a new class. 
By purposely combining data from different classes, the data are made to be 
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multi-modal. Table 4.7 provides information on the classes. 500 randomly 
selected samples from each classes are used as training data and the rest are 

used for test. 


Table 4.7 Class description. 


Class 

Subclass 

No. of Samples 

Total No. of Sample] 

Class co 1 

Spring Wheat 
July 9, 1978 

454 

972 

Spring Wheat 
July 26, 1978 

518 

Class g &2 

Winter Wheat 
June 26, 1 977 

677 

1339 

Winter Wheat 
Oct. 18, 1977 

662 

Class 0)3 

Spring Wheat 
Oct, 26, 1978 

441 

910 

Spring Wheat 
Seo. 21 . 1978 

469 


Figures 4.19-20 show the performance comparison. First the original 60 
dimensional data was reduced to 17 dimensional data using Uniform Feature 
Design. Next Decision Boundary Feature Extraction, Discriminant Analysis, and 
Principal Component Analysis were applied to the 17 dimensional data. With 
the 17 features, the classification accuracies of training data and test data are 
99.5% and 96.9%, respectively. In low dimensionality (N<2), Discriminant 
Analysis shows the best performances, though the difference between 
Discriminant Analysis and the decision boundary feature extraction method is 
small. However, when more than 2 features are used, the decision boundary 
feature extraction method outperforms all other methods. With 5 features, the 
decision boundary feature extraction method achieves about 96.4% 
classification accuracy for test data while Principal Component Analysis, 
Discriminant Analysis and Uniform Feature Design achieve about 90.5%, 
92.2%, and 87.9%, respectively. 

It can be said that when class mean differences are reasonably large and 
classes are uni-modal, Discriminant Analysis finds a good feature set. However, 
when classes are multi-modal, Discriminant Analysis does not often find a good 
feature set. On the other hand, Decision Boundary Feature Extraction finds a 
good feature set even when classes are multi-modal. 
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0 2 4 6 8 10 12 14 16 


Number of Features 

Figure 4.19 Performance comparison (train data). 



Figure 4.20 Performance comparison (test data). 

4.4.3 Eigenvalues of Decision Boundary Feature Matrix and Classification 
Accuracy 
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Table 4.8 lists the eigenvalues of the decision boundary feature matrix of 
the 17 dimensional data, along with proportions and accumulations. It also 
shows classification accuracies and normalized classification accuracies 
obtained by dividing classification accuracies with the classification accuracy 
obtained using the whole feature set. 

The rank of the decision boundary feature matrix (Iqbfm) must be 
decided upon, and in this case, somewhat arbitrarily so. Theoretically, the 
classification result obtained using all the eigenvectors of the decision 
boundary feature matrix corresponding to non-zero eigenvalues are the same 
as the classification result obtained using the whole feature set. However, for 
real data, eigenvalues of the decision boundary feature matrix are seldom zero, 
even though some eigenvalues are very close to zero, and there are large 
differences among the eigenvalues. As a result, although it is relatively easy to 
decide the rank of the decision boundary feature matrix for low dimensional 
generated data, it becomes less obvious for high dimensional real data. In non- 
parametric classification, it would be more difficult since the decision boundary 
and normal vectors are estimated. One may add eigenvalues until the 
accumulation exceeds 95% of the total sum and set that number of the 
eigenvalues as the rank of the L DBF m- Defined in this way, the rank of the E DBFM 
would be 9. Alternatively, one may retain the eigenvalues greater than one 
tenth of the largest eigenvalue. In this way, the rank of the I DB fm would be 6 - As 
can be seen of Table 4.8, the normalized classification accuracy increases 
monotonically as the accumulation of eigenvalues increases up to 5 features. 
After 5 features, the classification accuracy is almost saturated and adding more 
features does not improve classification accuracy. Figure 4.21 shows the 
relationship between the accumulations of eigenvalues and the normalized 
classification accuracies. More experiments are needed to obtain a better 
understanding on the relationship between the normalized classification 
accuracy and the accumulation of eigenvalues. 
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Table 4.8 


Figure 4.21 


Relationship between eigenvalues of the decision boundary 
feature matrix and classification accuracy. 

(Ev: Eigenvalues, Pro: Proportion, Accu: Accumulation, Cl. Ac: 
Classification Accuracy, N. Cl. Ac: Normalized Classification 
Accuracy) 


1 Ev 

Pro 

(%) 

Accu 

(%) 

Cl. Ac 
(%) 

N.Cl.Ac 

(%) 

1 

995.2 

34.5 

34.5 

57.1 

63.2 

2 

556.4 

19.3 

53.8 

84.7 

93.7 

3 

446.4 

15.5 

69.3 

87.9 

97.2 

4 

293.3 

10.2 

79.4 

88.5 

97.9 

5 

138.5 

4.8 

84.2 

89.8 

99.3 

6 

120.5 

4.2 

88.4 

89.8 

99.3 

7 

88.6 

3.1 

91.5 

90.1 

99.7 

8 

55.8 

1.9 

93.4 

90.1 

99.7 

9 

50.8 

1.8 

95.2 

90.5 

100.1 

10 

46.2 

1.6 

96.8 

90.2 

99.8 

11 

34.0 

1.2 

97.9 

90.1 

99.7 

12 

21.4 

0.7 

98.7 

90.2 

99.8 

13 

14.1 

0.5 

99.2 

90.3 

99.9 

14 

11.3 

0.4 

99.6 

90.4 

100.0 

15 

5.8 

0.2 

99.8 

90.4 

100.0 

16 

4.5 

0.2 

99.9 

90.4 

100.0 

17 

2.3 

0.1 

100.0 

90.4 

100.0 


.a 



Relationship between Accumulations of Eigenvalues and 
Normalized Classification Accuracies. 
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4.5 Estimation of the Decision Boundary and Normal Vector 

Since non-parametric classifiers do not define the decision boundary in 
analytic form, it must be estimated numerically. Then, from the estimated 
decision boundary, normal vectors are estimated as follows: 

Ah Ah Ah 

Vh W~A^ Xl+ A^ X2 + + AXn Xn 

Next we will investigate the effect of inaccurate estimation of the decision 
boundary and normal vectors on the performance of the proposed decision 
boundary feature extraction. 

4.5.1 Effect of Inaccurate Estimation of the Decision Boundary 

In the proposed procedure, we found a point on the decision boundary 
by moving along the line connecting two differently classified samples. In other 
words, by moving along the line, we try to find a point X such that 

h(X)=t 

When the difference between the decision boundary and an estimated decision 
boundary is smaller than a threshold, the searching procedure stopped. In other 
words, if 

(h(X) - t)(h(X') - 1) < 0 and |X - X'| < e 

we take either X or X’ as a point on the decision boundary. To investigate the 
sensitivity of the decision boundary feature extraction method, it was applied to 
the 17 dimensional data with various thresholds, e=0.01a, 0.05o, 0.1a, 0.5a, lo 
and 2a, where a is the average standard deviation, i.e., 

1 m n j 

where N is the number of features, M is the number of classes, and aj is j- 
th feature standard deviation of class ©;. 
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With 17 features, the classification accuracy is 90.4%. Figure 4.22 shows the 
performance comparison for the first 5 features. For 1 feature, there is not much 
difference. For 2 features, the classification accuracy decreases as the 
threshold increases. If more than 2 features are considered, the performances 
are essentially the same. When 3 features are used, all thresholds achieve 
about 89% classification accuracy. From the experiments, it appears that the 
threshold between 0.05o and 0.5a would be reasonable, and the performance 
of the decision boundary feature extraction method does not appear to be very 
sensitive to inaccurate estimation of the decision boundary if the estimated 
decision boundary is in the vicinity of the true decision boundary. Furthermore, 
there is no guarantee that a smaller threshold always results in a more accurate 
estimation of the decision boundary (section 4.2.3). 



■ e=0.01o 

| e=0.05a 

■ 0=0. lo 
H e=0.5o 

0 e=lo 

1 e=2 a 


Figure 4.22 Effect of inaccurate estimation of decision boundary on the 

performance of the decision boundary feature extraction method. 


4.5.2 Effect of the Parzen Scale Parameter h in Estimating Normal Vectors 


Since normal vectors are estimated using equation (4.6), the Parzen 
scale parameter h will affect the estimation of normal vectors. Since normal 
vectors are used to estimate the decision boundary feature matrix, the Parzen 
scale parameter will affect the performance of the decision boundary feature 
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extraction method. In the following test, we estimated the normal vectors using 
various Parzen scale parameters and investigate the effect of the Parzen scale 
parameter on the performance of the decision boundary feature extraction 
method. The decision boundary feature extraction method is applied to 18 
dimensional data. With 18 features the classification accuracy is 92.9%. Figure 
4 23 shows the performance comparison for various Parzen scale parameters 
in estimating normal vectors. When h-0.3. 0.5, 0.7, and 1.0, the classification 
accuracies with 3 features are 92.6%, 92.3%, 92.2%, and 92.1%, respectively. 
As larger Parzen scale parameters are used (h > 2), classification accuracies 
decrease, though the decreasing rate is relatively small. However, if the Parzen 
scale parameter is too small (h=0.1), the classification accuracy decreases 
considerably. Overall, the Parzen scale parameters between 0.5 and 1.0 give 
best results in this case. Although the performance of the decision boundary 
feature extraction method does not seem to be very sensitive to the variation of 
the Parzen scale parameter, care must be taken that the Parzen scale 
parameter should not be too small or too large for a given data. 



■ h-0.1 

■ h-0.3 
H h-0.5 
0 h-0.7 

□ h-1.0 

■ h-1.5 
0 h— 2.0 
E3 h-3.0 
El h-4.0 

□ h-5.0 


Figure 4.23 Performance comparison for various Parzen scale parameters in 
estimating normal vectors. 
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4.6 Conclusion 

Decision Boundary Feature Extraction is a new feature extraction 
technique which is derived from the fact that all the feature vectors needed in 
discriminating between classes for a given classifier can be obtained from the 
decision boundary defined by the given classifier. Instead of utilizing class 
mean differences or class covariance differences, the method utilizes the 
decision boundary directly. As a result, the method does not deteriorate under 
the circumstances of equal means or equal covariances, and can be used for 
both parametric and non-parametric classifiers. In this chapter we proposed a 
decision boundary feature extraction algorithm for non-parametric classifiers. By 
directly utilizing the decision boundary defined by an employed non-parametric 
classifier without any assumption about the distribution of data, the proposed 
feature selection algorithm can take advantage of the generality of the non- 
parametric classifier, which can define a complex decision boundary. The 
experiments show that the performance of the proposed algorithm is very 
promising. The importance of such algorithms is enhanced as the use of non- 
parametric classifiers such as neural networks continues to grow (Lee and 
Landgrebe 1992-2, Lee and Landgrebe 1992-3). 

Compared with the conventional feature extraction/selection algorithms, 
the proposed algorithm predicts the minimum number of features to achieve the 
same classification accuracy as in the original space and at the same time finds 
the needed feature vectors which have a direct relationship with classification 
accuracy. Unlike some of the conventional extraction algorithms using the 
lumped covariance, the proposed algorithm takes full advantage of the 
information contained in class covariance differences by extracting new 
features directly from the decision boundary. Since the information contained in 
the second order statistics increases its importance in discriminating between 
classes in high dimensional data, the proposed algorithm also has potential for 
feature extraction for high dimensional data and multi-source data. 
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CHAPTER 5 DECISION BOUNDARY FEATURE EXTRACTION FOR NEURAL 

NETWORKS 


5.1 Introduction 

Although neural networks have been successfully applied in various 
fields [(Ersoy and Hong 1990) (MoEliece et al. 1987) and (Fukushima and Wake 
1991)] relatively few feature extraction algorithms are available for neural 
networks. A characteristic of neural networks is that they need a long framing 
time but a relatively short classification time for test data. However, with more 
high dimensional data and multi-source data available, the resulting neura 
network can be very complex. Although once the networks are trained, the 
computational cost of neural networks is much smaller compared with other non- 
parametric classifiers such as the Parzen density estimator (Parzen 1962) and 
the kNN classifier (Cover and Hart 1967), the lack of efficient feature extraction 
methods inevitably will introduce some inefficient calculation into neural 
networks. For example, the number of multiplications needed to classify a test 
sample using a 2 layer feedforward neural network which has 20 input neurons, 
60 hidden neurons (assuming that the number of hidden neurons is three times 
the number of input neurons), and 3 output neurons is given by 

20*60 + 60*3 = 1 ,380 

Figure 5.1 illustrates an example of the hardware implementation of the original 
data set. 
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20 Input 60 hidden 3 output 

neurons neurons neurons 



Figure 5.1 Hardware implementation of the original data set. 

Assuming that it is possible to obtain about the same performance with 5 
features selected by a good feature extraction method, the number of 
multiplications needed to classify a test sample can be reduced to 

20*5 + 5*15 + 15*3 = 220 

Figure 5.2 illustrates an example of the hardware implementation of the reduced 
data set. The first 100 (=20*5) multiplications are needed to calculated the 5 
features from the original 20 dimensional data. In this example, the reduction 
ratio is 220/1380 * 0.16. The reduction ratio will increase as the number of 
hidden layers and the number of hidden neurons increase. Thus, by employing 
a good feature extraction method, the resulting network can be much faster and 
simpler. If the neural network is to be implemented in a serial computer, 
classification time can be substantially reduced. If the neural network is to be 
implemented in hardware, the complexity of the hardware can be substantially 
reduced since the complexity of the hardware is proportional to the number of 
neurons and multiplications (connections between neurons). Hardware 
implementation of neural networks is an important topic [(Moonpenn et al. 1987), 
(Yasunaga et al. 1991), and (Fisher et al. 1991)]. In order to integrate a neural 
network on a single chip, it is important to reduce the number of neurons. The 
proposed method can be used in such a case, reducing the complexity of the 
network. 
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20 input 5 new input 15 hidden 3 output 

neurons neurons neurons neurons 



Figure 5.2 Hardware implementation of the reduced data set. 

Neural networks are distribution free and can define arbitrary decision 
boundaries, and it is desirable that a feature extraction method for neural 
networks can preserve that characteristic. In this chapter, we apply the decision 
boundary feature extraction method to neural networks. First, we propose a 
feature extraction method for neural networks using the Parzen density 
estimator. In that method, we first select features using the Parzen density 
estimator employing the decision boundary feature extraction method. Then we 
use the selected features to train a neural network. Using a reduced feature set, 
we attempt to reduce the training time of a neural network and obtain a simpler 
neural network, further reducing the classification time for test data. 

Finally, we apply directly the decision boundary feature extraction 
algorithm to neural networks (Lee and Landgrebe 1992-3). By directly applying 
the decision boundary feature extraction algorithm to neural networks, there will 
be no saving in training time. However, we will obtain a simpler network with 
better performance. 
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5.2 Neural Networks 


5.2.1 Network Configurations 

We will briefly discuss the neuron and the structure of a 2-layer 
feedforward neural network which will be used in the experiments. 
Backpropagation is used to train the network. Figure 5.3 shows an example of 
the neuron (Wasserman 1989). 


Wl 


Xi 

W2 

y 

X2 

■ 

• Wn 

Xn 




Figure 5.3 Artificial neuron with activation function. 


A set of inputs each is multiplied by a weight, and the products are summed. 
Next an activation function F is applied to the summation, producing the signal 
OUT as follows: 


OUT = F(NET) = ^ + g-NETj- 


n 

where NET = ^XjW; 

i-1 


(5-1) 


In the above example, the sigmoid function is used for the activation function. 
Figure 5.4 shows a 2 layer neural network (input layer, hidden layer, and output 
layer) with 2 outputs (OUTi and OUT2). In Figure 5.4, let X be the input vector (1 
by N) and let Y be the output vector (1 by M) of the hidden layer. Then 

Y = F(WjX) (5.2) 

where X and Y are column vectors and W; is a weight matrix (M by N) for the 
input vector. 
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Input 

Layer 


Hidden Output 

Layer Layer 



Figure 5.4 An example of 2 layer feedforward neural networks (2 pattern 
classes). 


Then OUTi and OUT 2 can be expressed as follows: 

OUTi(X) = F(WJ 1 Y) = F(WlF(WjX)) (5.3) 

OUT2(X) = F(W^Y) = F(WhF(WjX)) (5.4) 

where wj, and are weight vector (M by 1) for the output vector of the hidden 
neurons. The decision rule is to select the class corresponding to the output 
neuron with the largest output (Lippmann 1987). 


5.2.2 Backpropagation 

The backpropagation algorithm is used to train the neural network 
[(Wasserman 1989) and (Hertz et al. 1991)] in the experiments. In the training 
phase, the weight changes are made by 

AWpq >k = rj 5q k OUT p j 

where 

ri = learning rate 

5 q k = the value of 5 for neuron q in the layer k 
OUT D j = the value of OUT for neuron p in the layer j. 

r * J 
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The hidden layers are trained by propagating the output error back through the 
network layer by layer, adjusting weights at each layer. 

However, it is noted that the decision boundary feature extraction 
algorithm can be used for neural networks regardless of training algorithms. 
Any other training algorithm can be employed. 


5.3 Feature Extraction for Neural Networks Using the Parzen Density Estimator 
5.3.1 Neural Networks and the Parzen Density Estimator 

An advantage of non-parametric classifiers is that they can define 
arbitrary decision boundaries without any assumption on underlying densities. If 
underlying densities are unknown or problems involve complex densities which 
can not be approximated by common parametric density functions, use of a 
non-parametric classifier may be necessary. Some of the most widely used 
non-parametric classifiers include the Parzen density estimator, the kNN 
classifier, and neural networks. Recently, Neural network classifiers have been 
applied to various fields and demonstrated to be attractive alternatives to 
conventional classifiers (Benediktsson et al. 1990). One of the characteristics of 
neural networks is a long training time. However, once networks are trained, 
classification for test data can be done relatively fast. 

In this section, we propose a feature extraction method for neural 
networks using the Parzen density estimator. We first select a new feature set 
using the decision boundary feature extraction algorithm for non-parametric 
classification in Chapter 4. By using the Parzen density estimator for feature 
extraction, we attempt to preserve the non-parametric characteristics of neural 
networks. Then the selected features are used to train neural networks. Using a 
reduced feature set, we attempt to reduce the training time of neural networks 
and obtain simpler neural network, further reducing the classification time for 
test data. Figure 5.5 shows an illustration of the proposed method. 
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Feature extraction using 


Train neural network using 

Parzen density estimator 


the extracted features 


Figure 5.5 Feature extraction for neural networks using 
Parzen density estimation. 


5.3.2 Experiments 

5.3.2.1 Experiments with generated data 

In order to evaluate closely how the proposed algorithm performs under 
various circumstances, tests are conducted on generated data with given 

statistics. 

Example 5.1 In this example, class is normal with the following statistics: 


M 1 = 




3 0.51 
0.5 3 J 


Class co 2 is equally divided between two normal distributions with the following 
statistics: 



r-3i „i 

' 2 0.5“ 

_3J S 2 = 

. 0.5 2. 


and 




2 0.5' 
0.5 2. 


400 samples are generated for each class. Figure 5.6 shows the distribution of 
the data along with the decision boundary found by the proposed procedure 
numerically. Eigenvalues and eigenvectors ifc of S E dbfm are calculated as 

follows: 



f0.69' 

. ro.72] 

X 1 = 0.98338, \ 2 = 0 - 01662 

$1 = j_-0-72j’ 

§2 = |_0.69j 


Since one eigenvalue is significantly larger than the other, it can be said that 
the rank of I EDBFM is 1- That means onl V one feature is needed t0 achieve the 
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same classification accuracy as in the original space. Considering the statistics 
of the two classes, the rank of S EDBFM gives the correct number of features 
needed to achieve the same classification accuracy as in the original space. 
The selected features are used to train neural networks. 
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Figure 5.6 Data distribution of Example 5.1. The decision boundary found 
by the proposed procedure is also shown. 


Table 5.1 shows the classification accuracies of the Parzen Density Estimator 
and neural networks. With one feature, the Parzen density estimator achieves 
about the same classification accuracy as could be obtained in the original 2- 
dimensional space. Likewise, the neural network achieves about the same 
classification accuracy with one feature selected by the proposed algorithm. 

Table 5.1 Classification accuracies of the Parzen density estimator 

and neural networks. 


Number of 
Features 

Parzen Density 
Estimator 

Neural 

Networks 

1 

91.4 (%) 

91.6 {%) 

2 

91.6 (%) 

90.9 (%) 
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Figure 5.7 shows a graph of the classification accuracies vs. the number of 
iterations. When one feature is used, the network converged after about 40 
iterations. When two features are used, the network converged after about 70 

iterations. 



Figure 5.7 Classification accuracies vs. the number of iterations. 

Example 5.2 In this example, there are 3 classes. Class o>, is normal with the 
following statistics: 


M 1 = 


r°' 

o 

0 


Li = 


4 0 0 
0 4 0 
0 0 9 


Class co 2 is equally divided between two normal distributions with the following 


statistics: 




'2 0 O' 
0 2 0 
0 0 9. 


and M 2 = 




'2 0 O' 
0 2 0 
0 0 9 


And class co 3 is equally divided between two normal distributions with the 
following statistics: 
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[9 0 0* 
0 2 0 
0 0 9 


and M3 = 



, f9 0 O' 
Z = 0 2 0 
0 0 9 


The distributions of these classes are shown as ellipses of concentration in 
Figure 5.8 in the x1-x2 plane. 



Figure 5.8 The distributions of Example 5.2 are shown as eclipse 
of concentrations. 

Table 5.2 shows the classification accuracies of the Parzen density estimator 
and neural networks. With two features, the Parzen density estimator achieves 
about the same classification accuracy as could be obtained in the original 3- 
dimensional space. Table 5.2 also shows the classification accuracies of the 
neural network. The proposed feature extraction method for neural networks 
using the Parzen density estimator finds the correct 2 features, achieving about 
the same classification as could be obtained using the original 3-dimensional 
data. 
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Table 5.2 Classification accuracies of the Parzen density estimator 

and neural network. 


Number of 
Features 

Parzen Density 
Estimator 

Neural 

Networks 

i 

65.0 (%) 

64.8 (%) 

2 

84.8 (%) 

84.3 (%) 

3 

84.8 (%) 

84.0 (%) 



Number of Features=1 
Number of Features=2 
Number of Features=3 


Figure 5.9 Classification accuracies vs. the number of iterations. 

Figure 5.9 shows a graph of the classification accuracies vs. the number of 
iterations. When one feature is used, the network essentially converged after 
about 50 iterations. When two features are used, the classification accuracies 
are almost saturated after about 75 iterations. After 150 iterations, the 
classification accuracy is about 84%. When three features are used, the network 
converged after about 75 iterations, achieving about 84% classification 
accuracy. 
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5.3.2.2 Experiments with real data 

Experiments were done using FSS (Field Spectrometer System) data 
which has 60 spectral bands (Biehl et al. 1982). To evaluate the performance of 
the proposed method, two other feature selection/extraction algorithms, Uniform 
Feature Design (see Section 3. 6. 2.1) and Principal Component Analysis (the 
Karhunen-Loeve transformation) are tested to evaluate and compare the 
performance of the proposed algorithm. 

In order to test the performance in a multimodal situation, 3 classes with 2 
subclasses were chosen. In other words, 2 subclasses were combined to form a 
new class, thus the data are purposely made multimodal. Table 5.3 provides 
information on the classes. In the experiment, 500 randomly selected samples 
from each class were used as training data and the rest were used as test data. 


Table 5.3 Class description. 


Class 

Subclass 

No. of Samples 

Total No. of Sample | 

Class co 1 

Spring Wheat 
July 9. 1978 

454 

972 

Spring Wheat 
July 26, 1978 

518 

Class CO 2 

Winter Wheat 
June 26, 1977 

677 

1339 

Winter Wheat 
Oct. 18. 1977 

662 

Class CO 3 

Spring Wheat 
Oct. 26, 1978 

441 

910 

Spring Wheat 
Sep. 21, 1978 

469 


First the original data are reduced to a 17 feature data set using Uniform 
Feature Design. Then, the Parzen density estimator is applied to the reduced 
data set and the decision boundary is calculated numerically. From the decision 
boundary, a decision boundary feature matrix is estimated and a new feature 
set is calculated from the decision boundary feature matrix. 

Using the features selected by Parzen density estimator, neural networks 
are trained. In order to evaluate the performance of the proposed algorithm, two 
other feature sets selected by Uniform Feature Design and Principal 
Component Analysis are also used to train the network. The classification 
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accuracies of training data and test data of the 1 through 10 dimensional data 
sets selected by the proposed algorithm are shown in Figures 5.10-11. As can 
be seen, the neural network using the proposed algorithm shows considerably 
better performances than the neural networks using Uniform Feature Design 
and Principal Component Analysis. In Figure 5.10, the 4-5 features selected by 
the proposed algorithm achieved about the same classification accuracy as can 
be obtained with the original 17 dimensional data. 

Figures 5.12-13 show graphs of classification accuracy vs. number of 
iterations. From Figure 5.13, it can be said that the performances of the neural 
networks are saturated after about 100-200 iterations. The training time is 
proportional to the number of iterations and the square number of neurons. 
When the network is implemented in hardware, the complexity of the hardware 
will be proportional to the square number of neurons. As a result, by using the 
Parzen density estimator to select features for neural networks, one can reduce 
the training time and the complexity of the hardware implementation. For 
example, in Figure 5.13 (test data), the classification accuracy with 10 features 
is 96.0% and the classification accuracy with 4 features is 93.3%. The difference 
is 2.7%. If such a decrease in classification accuracy is acceptable, the training 
time can be reduced by the factor of 6.25. Furthermore, the classification time 
will be also reduced by the same factor when implemented in a serial computer. 
When implemented in hardware, the complexity of the hardware can be also 

reduced by the same factor. 
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Figure 5.10 Performance comparison (training data). 
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Figure 5.1 1 Performance comparison (test data). 
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Number of Iterations 

Figure 5.12 Iteration vs. classification accuracy (training data). 
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Figure 5.13 Iteration vs. classification accuracy (test data). 


It is found that using the Parzen density estimator to select features for 
neural networks is not optimal in a sense that the performance can be improved 
if the decision boundary feature extraction method is directly applied to the 
neural network. In the following section, the decision boundary feature 
extraction method will be directly applied to the neural network. However, by 
directly applying the decision boundary feature extraction method to the neural 
network, there will be no saving in training time. On the other hand, using the 
Parzen density estimator to select features for neural networks results in 
reduction in training time. 
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5.4 Decision Boundary Feature Extraction for Neural Networks 

5.4.1 Decision Boundaries in Neural Networks 

In order to utilize the decision boundary feature extraction algorithm for 
neural networks, the decision boundary must be defined. We define the 
decision boundary in multi-layer feedforward neural networks as follows: 


Definition 5.1 The decision boundary in a neural network for a two pattern class 
problem is defined as 

{ X | OUTi (X) = OUT 2 (X) } or (5.5) 

where X is an input vector {See equations (5.3), (5.4), and (5.5)}. 

In other words, the decision boundary of a two pattern class problem is 
defined as a locus of points on which OUTi(X) = OUT 2 (X) where X is an input 
vector. Let h(X)=OUTi(X) - OUT 2 (X) where X is an input vector to a neural 
network. Then the decision boundary can be defined as 

{ X | h(X)=0 } ( 5 - 6 ) 


The normal vector to the decision boundary at X will be given by 


ah 


ah 


Vh (X) = 3xi Xi + a X2 


X2 + 


ah 

ax 


(5.7) 


Since the decision boundary in neural networks can not be expressed 
analytically, the term Vh(X) must be calculated numerically as follows: 


Ah Ah 

Vh(X)=^x 1+ ^X2 + 


Ah 

+ A^; Xn 


(5.8) 


5.4.2 Decision Boundary Feature Extraction Procedure for Neural Networks 


Next we propose the following procedure for neural networks utilizing the 
decision boundary feature extraction algorithm. 
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Decision Boundary Feature Extraction Procedure for Neural Networks 

( 2 pattern class case) 

STEP 1 : Train the neural network using all features. 

STEP 2: For each sample correctly classified as class a>,, find the nearest 
sample correctly classified as class (o 2 . Repeat the same procedure 
for the samples classified as class ov 

STEP 3: The lines connecting a pair of samples found in STEP 2 must pass 
through the decision boundary since the pair of samples are 
classified differently. By moving along the line, find the point on the 
decision boundary or near the decision boundary within a threshold. 
STEP 4: At each point found in STEP 3, estimate the normal vector N| by 

Nj =Vh(X) / |Vh(X)| 

, _. Ah Ah Ah 

where Vh(X) - ^ * + ^X 2 + + ^ Xn 

h(X) = OUTi (X) - OUT 2 (X) {See equation (5.5)}. 

STEP 5: Estimate the decision boundary feature matrix using the normal 
vectors found in STEP 4. 

^EDBFM = lX N i N i 

i 

where L is the number of samples correctly classified 

STEP 6: Select the eigenvectors of the decision boundary feature matrix as 
new feature vectors according to the magnitude of corresponding 
eigenvalues. 

If there are more than 2 classes, the procedure can be repeated for each 
pair of classes after the network is trained for all classes. Then the total decision 
boundary feature matrix can be calculated by averaging the decision boundary 
feature matrix of each pair of classes. If prior probabilities are available, the 
summation can be weighted. That is, if there are M classes, the total decision 
boundary feature matrix can be calculated as 

M M 

-DBFM = X X P(^)P(®j)^BFM 

i j. i*' 
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where I&bfm is a decision boundary feature matrix between class coj and 
class coj and P(coj) is the prior probability of class o)j if available. Otherwise let 
P(o)j)=1/M. 


5.4.3 Experiments 


5.4.3.1 Experiments with generated data 

In order to evaluate closely how the proposed algorithm performs under 
various circumstances, tests are conducted on generated data with given 
statistics. 


Example 5.3. In this example, class co-| is normal with the following statistics: 


M 1 = 



S 1 = 


' 3 0.5 
0.5 3. 


And class 0)2 is equally divided between two normal distributions with the 
following statistics: 


M 


1 

2 = 



4 = 


2 0.5' 
0.5 2. 


and M| 


= |_-3J ^2-|_ 


2 0.5" 
0.5 2. 


200 samples are generated for each class. Figure 5.14 shows the distribution of 
the data along with the decision boundary found by the proposed procedure 
numerically. Eigenvalues and eigenvectors <h of Iedbfm are calculated as 

follows: 



ro.72" 

ro.69" 

X 1 = 0.98105, X 2 =0.01 895 

= L-0.69J’ 

4*2 = [_0 .7 2_ 
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Figure 5.14 Data distribution and the decision boundary found by the 
proposed procedure. 

Since one eigenvalue is significantly larger than the other, it can be said that 
the rank of Sedbfm ' s 1 • That means only one feature is needed to achieve the 
same classification accuracy as in the original space. Considering the statistics 
of the two classes, the rank of I EDBFM gives the correct number of features 
needed to achieve the same classification accuracy as in the original space. 
With the original 2 features, the classification accuracy is about 90.8%. Table 
5.4 shows the classification accuracies of the decision boundary feature 
extraction method. As can be seen, the decision boundary feature extraction 
method finds the right feature, achieving about the same classification accuracy 
with one feature. Figure 5.15 shows classification accuracies vs. number of 
iterations. 

Table 5.4 Classification accuracies of Decision Boundary Feature Extraction 
of Example 5.3. 


Number of Features 

Classification Accuracy 

1 

91.6 <%) 

2 

90,9 (%) 
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— o — Number of Feature =1 
— < o — Number of Feature =2 

Figure 5.15 Iteration vs. classification accuracy. 


Example 5.4 In this example, there are 3 classes. Class o>i is normal with the 
following statistics: 


[O' 


•4 

0 

0] 

0 

I1 = 

0 

4 

0 

0 

0 

0 

9 


And class 0)2 is equally divided between two normal distributions with the 


following statistics: 


M 


1 

2 = 


1 - 5 - 

0 

0 


4 = 


2 0 O' 
0 2 0 
0 0 9 


and M2 = 




0 O' 
2 0 
0 9 


And class 0)3 is equally divided between two normal distributions with the 


following statistics: 


M 


1 

3 = 


[O' 

5 

0 


4 = 


■9 0 O' 
0 2 0 
0 0 9 


and M3 = 




0 O' 
2 0 
0 9 
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The distributions of these classes are shown in Figure 5.16 in the x1-x2 plane, 
along with the decision boundary found by the procedure. Eigenvalues A.j and 
eigenvectors ft of I EDB fm are calculated as follows: 


X 1 = 0.59957 , X 2 = 0.40027, X 3 = 0.00015 


[0.061 


r - i.ooi 


[0.021 

1.00 

. 02 = 

0.06 

. 03 = 

0.02 

-0.02 

0.02 


1.00 


Rank(Z EDB FM) = 2 



Feature 1 

o class 1 A class 2 o class 3 
♦ Decision boundary found by the procedure 


Figure 5.16 Data distribution and the decision boundary found by the 
proposed procedure. 

With the original 3 features, the classification accuracy is about 85.7%. Table 
5.5 shows the classification accuracies of the decision boundary feature 
extraction method. As can be seen, the decision boundary feature extraction 
method finds the right two features. Figure 5.17 shows a graph of the 
classification accuracies vs. the number of iterations. 
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Table 5.5 Classification accuracies of Decision Boundary Feature Extraction of 

Example 5.4. 


No. Features 

Classification Accuracy 

1 

62.8 (%) 

2 

85.7 (%) 

3 

85.8 (%) 



Number of Features=1 
Number of Features=2 
Number of Fealures=3 


Figure 5.17 Iteration vs. classification accuracy. 

5.4.3. 2 Experiments with real data 

Experiments were done using FSS (Field Spectrometer System) data 
which has 60 spectral bands (Biehl et al. 1982). Along with the proposed 
algorithm, three other feature extraction algorithms, Uniform Feature Design 
(see Section 3. 6. 2.1) and the Karhunen-Loeve transformation (Principal 
Component Analysis), and Discriminant Analysis (Fukunaga 1990) are tested to 
evaluate to evaluate the performance of the proposed algorithm. 
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In the following test, 4 classes were chosen from the FSS data. Table 5.6 
provides information on the 4 classes. 400 randomly selected samples from 
each classes are used as training data and the rest are used for test. 


Table 5.6 Class description. 


SPECIES 

DATE 

No. of Samples | 

Winter Wheat 

May 3, 1977 

657 

Unknown Crops 

May 3, 1977 

678 

Winter Wheat 

Mar. 8. 1977 

691 

Unknown Crops 

Mar. 8, 1977 

619 


First the original 60 dimensional data was reduced to 17 dimensional data 
using Uniform Feature Design. And the decision boundary feature extraction 
method, Discriminant Analysis, and Principal Component Analysis were applied 
to the 17 dimensional data. Figures 5.18 and 5.19 show the classification 
results of training data and test data. With the 17 dimensional data, one can 
achieve about 97.6% classification accuracy for training data and about 94.4% 
classification accuracy for test data. The decision boundary feature extraction 
method achieves about the same classification accuracy for test data with just 3 
features as can be seen in Figure 5.19. With 3 features, the decision boundary 
feature extraction method achieves about 92.2% classification accuracy for test 
data while Uniform Feature Design, Principal Component Analysis, and 
Discriminant Analysis achieve about 77.7%, 78.6%, 89.7%, respectively. 
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2 4 6 8 10 12 14 16 


Number of Features 

Figure 5.18 Performance comparison of the data of Table 5.6 (Train data). 



Figure 5.19 Performance comparison of the data of Table 5.6 (Test data). 


In the next test, there are 3 classes and each class has 2 subclasses. In 
other words, 2 subclasses were combined to form a new class. By purposely 
combining data from different classes, the data are made to be multi-modal. 
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Table 5.7 provides information on the classes. Figure 4.13 (Chapter 4) shows a 
graph of the 6 subclasses and Figure 4.14 (Chapter 4) shows a graph of the 3 
classes each of which has 2 subclasses. 500 randomly selected samples from 
each classes are used as training data and the rest are used for test. 


Table 5.7 Class description. 


Class 

Subclass 

No. of Samples 

Total No. of Sample | 

Class co 1 

Winter Wheat 
March 8, 1977 

691 

1209 

Spring Wheat 
July 26, 1978 

518 

Class coj 

Winter Wheat 
June 26, 1977 

677 

1146 

Spring Wheat 
Sep. 21. 1978 

469 

Class CO 3 

Winter Wheat 
Oct. 18, 1977 

662 

1103 

Spring Wheat 
Oct. 26, 1978 

441 


Figures 5.20-21 show the performance comparison. First the original 60 
dimensional data was reduced to 17 dimensional data using Uniform Feature 
Design. With the 17 features, the classification accuracies of training data and 
test data are 99.9% and 95.6%, respectively. In low dimensionality (N<2), 
Discriminant Analysis shows the best performances. However, the classification 
accuracies are much smaller than the maximum possible classification 
accuracies and the comparison seems irrelevant. When more than 2 features 
are used, the decision boundary feature extraction method outperforms all other 
methods. The decision boundary feature extraction method achieves about the 
same classification accuracy as could be obtained in the original 17- 
dimensional space with just 4 features. In particular, with 4 features, the 
classification accuracy of the decision boundary feature extraction method is 
about 92.4% while the classification accuracies of Uniform Feature Design, 
Principal Component Analysis, and Discriminant Analysis are 78.1%, 82.5%, 
and 82.3%, respectively. 
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Figure 5.20 Performance comparison of the data of Table 5.7 (train data). 
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Figure 5.21 Performance comparison of the data of Table 5.7 (test data). 
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Figures 5.22-23 show graphs of classification accuracy vs. number of iterations 
of the decision boundary feature extraction method. As can be seen, the 
performances of neural networks are saturated after about 100 iterations. It can 
be also seen in Figure 5.23 that the performances are almost saturated when 4 
features are used. 



Number of Iterations 

— 0 — Number of Features=3 
— * — Number of Features-4 
— ■ — Number of Features-5 
— • — Number of Features=6 
0 — Number of Features* 1 7 

Figure 5.22 Iteration vs. classification accuracy (training data). 
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— □ — Number of Features=3 
— * — Number of Features=4 
— ■ — Number of Features=5 
— ® — Number of Features=6 
— o — Number of Features=17 

Figure 5.23 Iteration vs. classification accuracy (test data). 


In the next test, there are 3 classes and each class has 2 subclasses. In 
other words, 2 subclasses were combined to form a new class. By purposely 
combining data from different classes, the data are made to be multi-modal. 
Table 5.8 provides information on the classes. 500 randomly selected samples 
from each classes are used as training data and the rest are used for test. 
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Table 5.8 Class description. 


| Class 

Subclass 

No. of Samples 

Total No. of Sample 

Class o ) 1 

Winter Wheat 
March 8 , 1977 

691 

1145 

Spring Wheat 
July 9, 1978 

454 

Class cog 

Winter Wheat 
June 26, 1977 

677 

1195 

Spring Wheat 
July 26, 1978 

518 

Class 033 

Winter Wheat 
Oct. 18, 1977 

662 

1103 

Spring Wheat 
Oct. 26, 1978 

441 


Figures 5.24-25 show the performance comparison. First the original 60 
dimensional data was reduced to 17 dimensional data using Uniform Feature 
Design. And the decision boundary feature extraction method, Discriminant 
Analysis, and Principal Component Analysis were applied to the 17 
dimensional data. With the 17 features, the classification accuracies of training 
data and test data are 97.3% and 96.7%, respectively. In low dimensionality 
(N<2), Discriminant Analysis shows the best performances, though the 
difference between Discriminant Analysis and the decision boundary feature 
extraction method is small. However, when more than 2 features are used, the 
decision boundary feature extraction method outperforms all other methods. 
With 3 features, the decision boundary feature extraction method achieves 
about 95.6% classification accuracy for test data while Uniform Feature Design, 
Principal Component Analysis, and Discriminant Analysis achieve about 82.3%, 
85.1%, and 90.8%, respectively. 
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Figure 5.24 Performance comparison (training data). 
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Figure 5.25 Performance comparison (test data). 
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5.5 Conclusion 

In this chapter, we extended the decision boundary feature extraction 
method to neural networks. First we proposed the feature extraction algorithm 
for neural networks using Parzen density estimator (Figure 5.5). In this method, 
we first selected a new feature set using Parzen density estimator employing 
the non-parametric decision boundary feature extraction algorithm in Chapter 4. 
Then we used the reduced feature set to train neural networks. As a result, it 
would be possible to reduce training time and to obtain a simpler network 
because fewer features are used. 

However, it is recognized that the characteristics of the Parzen density 
estimator and neural networks are not exactly the same. Thus, we applied the 
decision boundary feature extraction method directly to neural networks. We 
started by defining the decision boundary in a neural network. From the 
decision boundary, we estimated the normal vectors to the decision boundary, 
and the decision boundary feature matrix was calculated. From the decision 
boundary feature matrix, a new feature set was calculated. By directly applying 
the decision boundary feature extraction algorithm to neural networks, the 
performance was improved compared with using the Parzen density estimator 
for feature extraction. However, it is noted that by directly applying the decision 
boundary feature extraction algorithm to neural networks, there is no reduction 
in training time. In fact, the training time increased since we need to train two 
networks, one for the original feature set and the other for the reduced feature 
set. 


The proposed algorithms preserve the nature of neural networks which 
can define a complex decision boundary and is able to take advantage of that 
nature. By employing the proposed algorithms, it is possible to reduce the 
number of features, and equivalently the number of neurons. This reduction 
results in much simpler networks and shorter classification time. When neural 
networks are to be implemented in hardware, the reduced number of neurons 
means a simpler architecture (Figures 5.1-2). 
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CHAPTER 6 DISCRIMINANT FEATURE EXTRACTION FOR 
PARAMETRIC AND NON-PARAMETRIC CLASSIFIERS 


6.1 Introduction 

In Chapter 3, the decision boundary feature extraction algorithm was 
developed where a new feature set is extracted from the decision boundary 
such that the classification result is preserved. The decision boundary feature 
extraction was applied to parametric classifiers (Chapter 3), to non-parametric 
classifiers (Chapter 4), and to neural networks (Chapter 5). In order to extract 
feature vectors from the decision boundary, the decision boundary feature 
matrix was defined which is constructed from the normal vectors to the decision 
boundary. In the decision boundary feature extraction techniques, we do not 
care whether the value of the discriminant function is changed or not, as long as 
the classification result remains the same. 


In this chapter, the concept of decision boundary feature extraction 
algorithm is generalized such that feature extraction is considered as 
preserving the value of the discriminant function for a given classifier (Lee and 
Landgrebe 1992-4). And we consider feature extraction as eliminating features 
which have no impact on the value of the discriminant function and propose a 
feature extraction algorithm which eliminates those irrelevant features and 
retains only useful features. The proposed algorithm, referred as Discriminant 
Feature Extraction, can be used both for parametric and non-parametric 
classifiers and its performance does not deteriorate when there is no difference 
in mean vectors or no difference in covariance matrices. 

Compared with the decision boundary feature extraction algorithms, 
Discriminant Feature Extraction will be less efficient for parametric classifiers 
where a good estimation of the decision boundary can be obtained. However, 
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in some non-parametric classifiers, it is difficult or time-consuming to find the 
decision boundary. In such cases, Discriminant Feature Extraction could be a 
good alternative solution. Furthermore, by extracting features such that the 
value of interest is preserved, Discriminant Feature Extraction can be used not 
only for feature extraction for classification but also for feature extraction for any 
application. More detailed comparison will be made later. 


6.2 Definitions and Theorems 

We will briefly review Bayes' decision rule for minimum error. Let X be an 
observation in the N-dimensional Euclidean space E N under hypothesis H|: X e 
tOj i=1,2. Decisions will be made according to the following rule (Fukunaga 
1990): 


Decide coi if P(a)i)P(X|coi) > P(o> 2 )P(X|a) 2 ) 
else co 2 

Let h(X) = and t = ln P -~— . Then the decision rule will be 

P(X|0) 2 ) P(C02) 

Decide co-, if h(X) < t 
else 0)2 

(6.1) 
( 6 . 2 ) 

For the purpose of the proposed feature extraction, we start with defining 
"discriminantly irrelevant feature" as follows: 1 


where h(X) = -In 


P(X|co 1 ) 

P(X|co 2 ) 


t = In 


P(<0i) 

P(C02) 


1 We distinguish "discriminantly irrelevant feature" from "discriminant redundant feature" 
(Definition 3.1) in that the discriminantly irrelevant feature does not change the value of the 
discriminant function while the discriminant redundant feature does not change the 
classification result. 
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Definition 6.1 We say the vector p k is discriminantly irrelevant for any 
observation X 

h(X) = h(X+cp k ) 
where c is a constant. 

Since h(X) = h(X+c|3 k ), the classification result for X+cp k is the same as the 
classification result of X. It can be easily seen that the discriminantly irrelevant 
feature does not contribute anything in discriminating between classes. 

In a similar manner, we define "discriminantly relevant feature" as follows. 

Definition 6.2 We say the vector (3 k is discriminantly relevant if there exists at 
least one observation X such that 

h(X) * h(X+cp k ) 
where c is a constant. 


From these definitions, it is clear that all discriminantly irrelevant features are 
features which have no impact on the value of the discriminant function and can 
be eliminated without increasing any classification error. Thus if it is possible to 
find all the discriminantly irrelevant features for a given classifier, it will be also 
possible to obtain the same classification accuracy as in the original space with 
a reduced number of features. To eliminate discriminantly irrelevant features for 
a given classifier, or equivalently to retain discriminantly relevant features, we 
define the discriminant feature matrix as follows: 

Definition 6.3 The discriminant feature matrix (DFM): The discriminant feature 
matrix is defined as 


Sdfu = jN(X)N(X)'p(X)dX 

where N(X) -Vh(X) /|Vh(X)| 

D(X) is a probability density function 

ah ah ah 

vh(X) = s; x, + ^» + +£;*- 
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h(X) = -In 


P(X|co 1 ) 
P(X|co 2 ) 


The property of the discriminant feature matrix is similar to that of the 
decision boundary feature matrix. The proof is identical to those in section 3.5.3. 

Property 6.1 The discriminant feature matrix is a real symmetric matrix. 

Property 6.2 The eigenvectors of the discriminant feature matrix are 
orthogonal. 

Property 6.3 The discriminant feature matrix is positive semi-definite. 

Property 6.4 The discriminant feature matrix of the whole space can be 
expressed as a summation of the discriminant feature matrices 
calculated from subspaces of the whole space if the subspaces are 
mutually exclusive and exhaustive. 

Now we will show that all the eigenvectors of the discriminant feature 
matrix corresponding to zero-eigenvalues are discriminantly irrelevant and can 
be eliminated without increasing the classification error. In this regard, we state 
the following theorem. 

Theorem 6.1 The eigenvectors of the discriminant feature matrix of a 
pattern classification problem corresponding to zero-eigenvalues are 
discriminantly irrelevant and can be eliminated without increasing any 
classification error. 

Proof: We assume h(X) is continuous and differentiable for all X. Let Sqfm be 
the discriminant feature matrix as defined in Definition 6.3. Suppose that 


rank(X DFM ) = M < N. 


Let {<!>-, . <J> 2 ,.., 4 »m} be the eigenvectors of I DFM corresponding to non-zero 
eigenvalues. Then, for any X, Vh(X) can be represented by a linear 
combination of 4>j, i=1,M. In other words, 
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M 

Vh(X) = X a i^i where a i is a coefficient 

i-1 

Let q> be an eigenvector whose corresponding eigenvalue is zero. Then, 9 is 
orthonormal to any eigenvector whose eigenvalue is not zero since 
eigenvectors of symmetric matrices are orthogonal to each other. It can be 
easily seen that the discriminant feature matrix is symmetric (Definition 6.3). 

Thus, cp is orthogonal to Vh(X) for any X since 

M m 

<pVh(X) = cpX 3 ^ = = 0 

i-1 i»1 


Assume that 9 is not discriminantly irrelevant, i.e. discriminantly relevant. Then 
there exists at least one observation Y such that 

h(Y) * h(Y+C9) where c is a constant 

Let h(Y)=t 0> h(Y+C9)-t„ and t 0 *t r Then there will be a point Y’ between Y and 
Y+C9 such that 


Physically, h(X)=(t 0 +t,)/2 is a surface and Vh(Y’) is a normal vector to the 
surface at Y\ Then Y and Y+C9 must be on different sides of the surface 
h(X)=(t 0 +t 1 )/2. This means C9 must pass through the surface h(X)=(t 0 +t 1 )/2 at Y . 
This contradicts the assumption that 9 is orthogonal to Vh(X) for any X 
including Vh(Y'). Therefore if 9 is an eigenvector of the discriminant feature 
matrix whose corresponding eigenvalue is zero, 9 is discriminantly irrelevant 
and can be eliminated without increasing any classification error. 

Q.E.D. 

Figure 6.1 shows an illustration of the proof. It is impossible that 09 passes 
through the surface h(X)=.(t 0 +t 1 )/2 at Y* and is orthogonal to Vh(Y') at the same 

time. 
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From Theorem 6.1, it can be easily seen that, if the rank of the 
discriminant feature matrix is M, the minimum number of features needed to 
achieve the same classification accuracy as in the original space must be 
smaller than or equal to M. In particular, if the rank of the discriminant feature 
matrix is 1, only one feature is needed to achieve the maximum classification 
accuracy. This will happen when the covariance matrices of the two classes are 
the same assuming a Gaussian ML classifier is used. However, it is noted that 
all eigenvectors whose eigenvalues are not zero are not necessarily needed to 
achieve the same classification accuracy as could be obtained in the original 
space. 


We will refer the feature extraction algorithm based on the Theorem 6.1 
as Discriminant Feature Extraction. In practice, we will choose eigenvectors of 
the discriminant feature matrix according to the magnitude of the corresponding 
eigenvalues. 


6.3 Discriminant Feature Extraction and Decision Boundary Feature Extraction 

In Chapter 3, we introduced the decision boundary feature extraction 
algorithm. It was shown that all the needed feature vectors for classification can 
be extracted from the decision boundary. The decision boundary feature 
extraction algorithm was successfully applied to parametric classifiers in 
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Chapter 3, non-parametric classifiers in Chapter 4, and neural networks in 
Chapter 5. Now we will show that the discriminant feature extraction method is a 
generalized form of the decision boundary feature extraction method. 

In the spectral decomposition of a matrix A (n by n), A can be 
represented by (Cullen 1 972) 


A= X.-J Ei + X n E n + ... + X. n E n (6.4) 

where the matrices Ej are called the projectors of A or the principal idempotents 
of A. In Chapter 3, the decision boundary feature matrix is defined as follows 
(Definition 3.6): 


jN(X)N'(X)p(X)dX (6.5) 


There is a similarity between equations (6.4) and (6.5). In fact, the decision 
boundary feature matrix can be viewed as a matrix whose principal idempotents 
are constructed from normal vectors to the decision boundary. As a result, in the 
decision boundary feature extraction method, a new feature set is extracted so 
that the classification results are preserved. 

On the other hand, in the discriminant feature extraction method, the 
discriminant feature matrix is defined as follows (Definition 6.3): 

I D fm = jN(X)N(X) t p(X)dX (6.3) 

The discriminant feature matrix can be viewed as a matrix whose principal 
idempotents are constructed from vectors which give changes to the value of 
the discriminant function. As a result, in the discriminant feature extraction 
method, a new feature set is extracted such that the value of the discriminant 
function for a given classifier is preserved. Consider the example in Figure 6.1. 
In the decision boundary feature extraction, the value of h(X) in (6.1) can be 
changed as long as the classification of X remains the same. In Discriminant 
Feature Extraction, the value of h(X) is preserved. As a result, the decision 
boundary feature extraction method will be more efficient if the decision 
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boundary can be exactly located, which is a general case of parametric 
classifiers. However, when finding the decision boundary is difficult or time 
consuming as in some cases of non-parametric classifiers, the discriminant 
feature extraction method could be an alternative. Furthermore, since the 
discriminant feature extraction method finds all the vectors which give changes 
in the value of the discriminant function, such a generalization can be used for 
feature extraction of other applications such as density estimation, non- 
parametric regression, and etc. 


t=o 



Figure 6.2 Decision Boundary Feature Extraction and 
Discriminant Feature Extraction. 


6.4 Discriminant Feature Extraction 

6.4.1 Discriminant Feature Extraction for Two Pattern Classes 

Now we propose a procedure to calculate the discriminant feature matrix 
for parametric and non-parametric classifications. 

Procedure for Discriminant Feature Extraction for 
Parametric/Non-Parametric Classifications 
( 2 pattern class case) 


1. Classify the training data using full dimensionality. 
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2. From correctly classified samples, estimate the discriminant feature matrix as 
follows: 

^dfm = lX 

i 

where L : the number of samples correctly classified 

Nj =Vh(X) / |Vh(X)| and 

3h 3h 5h 

Vh W = axT Xl + axi X2 + + 3^ Xn 

For non-parametric classifications, estimate Vh(X) as follows. 

Ah Ah Ah 

Vh(X)- S ;x, +s3 x 2 + +-&,* 


3. Select the eigenvectors of the decision boundary feature matrix as new 
feature vectors according to the magnitude of corresponding eigenvalues. 

6.4.2 Discriminant Feature Extraction for Multiclass Case 


If there are more than 2 classes, the procedure can be repeated for each 
pair of classes. The total discriminant feature matrix can be calculated by 
averaging the discriminant feature matrix of each pair of classes. If prior 
probabilities are available, the summation can be weighted. In other words, if 
there are M classes, the total discriminant feature matrix can be calculated as 

MM - 

£DFM = X X ,q gx 

i j. H 


where 4™ is a discriminant feature matrix between class coj and class Oj 
and P(G)|) is the prior probability of class a)j if available. Otherwise let 
P(o)j)=1/M. 


6.4.3 Eliminating Redundancy in Multiclass Problems 

The total discriminant feature matrix defined in equation (6.6), can be 
made more efficient. Consider the following example situation. Suppose Table 
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6.1 shows eigenvalues for a 2 pattern class problem. Table 6.1 also shows 
proportions of the eigenvalues, classification accuracies, and normalized 
classification accuracies obtained by dividing the classification accuracies by 
the classification accuracy obtained using all features. With just one feature, the 
classification accuracy is 91.6% which is 97.3% of the classification accuracy 
obtained using all features. Thus, in this 2 class problem, if this level of accuracy 
is deemed adequate, just one feature is necessary to be included in calculating 
the total discriminant feature matrix. The other 19 features contributes little in 
improving classification accuracy and can be eliminated in calculating the total 
discriminant feature matrix. In addition, feature vectors from other pairs of 
classes will improve the classification accuracy. 


Table 6.1 Eigenvalues of the discriminant feature matrix. 


Eigenvalue 

Proportion of 
Eigenvalue 
<*> 

Accumulation of 
Eigenvalue 

(%) 

Classification 

Accuracy 

(%) 

Normalized 
Classification 
Accuracy (%) 

0.323 

6.7 

6.7 

91.6 

97.3 

0.211 

4.4 

11.1 

93.3 

99.1 

0.149 

3.1 

14.2 

92.7 

98.5 

0.093 

1.9 

16.1 

92.9 

98.7 

0.069 

1.4 

17.6 

91.8 

97.6 

0.048 

1.0 

18.5 

93.9 

99.8 

0.039 

0.8 

19.4 

94.1 

100.0 

0.026 

0.5 

19.9 

94.5 

100.4 

0.018 

0.4 

20.3 

94.5 

100.4 

0.012 

0.3 

20.5 

94.7 

100.6 

0.006 

0.1 

20.7 

94.3 

100.2 

0.002 

0.0 

20.7 

94.7 

100.6 

0.002 

0.0 

20.7 

94.5 

100.4 

0.001 

0.0 

20.8 

94.7 

100.6 

0.001 

0.0 

20.8 

94.5 

100.4 

0.001 

0,0 

20.8 

94.1 

100.0 

0.000 

0.0 

20.8 

94.1 

100.0 

0.000 

0.0 

20.8 

94.1 

100.0 

0.000 

0.0 

20.8 

94.1 

100.0 

0.000 

0.0 

20.8 

94.1 

100.0 


To eliminate such redundancy in multiclass problems, we define the 
discriminant feature matrix of P t (I DFM ( Pt) ) as follows: 

Definition 6.4 Let L t be the number of eigenvectors corresponding to largest 
eigenvalues needed to obtain P t of the classification accuracy obtained 
with all features. Then the discriminant feature matrix of P t (£ DBFM(Pt) ) as 
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6 Discriminant Feature Extraction 


^DFM(Pt) = X^Wi 
1*1 

where Xj and (pj are eigenvalues and eigenvectors of the discriminant 
feature matrix. 

And the total discriminant feature matrix of P| of multiclass problem can be 
defined as 


M M j. 

^DFM(Pt) = X X P( C0 i)^( 0J j) 2 DFM(R) 

i-1 j-1 j*i 

where l'i FM( p t) is the discriminant feature matrix of P t between 
class a>j and class coj and P(coj) is the prior probability of class o)j if 
available. Otherwise let P(a>j)=1/M. 

In the experiments to follow, P( is set to between 0.95 and 0.97 (see section 
3.5.6). 

From Definition 6.4, we can calculate the discriminant feature matrix of 0.95 of 
Table 6.1 as follows: 

The classification accuracy using full dimensionality (20) is 94.1%. The 
number of features needed to achieve classification accuracy of 
89.4%(=94.1 *0.95) is 1. Therefore, the discriminant feature matrix of 0.95 of 
Table 6.1 is given by 

£dFM(0.95) = X = ^1^1 ^ 

i-1 

where Xj's are eigenvalues of £qfm sorted in descending order and cpj s 
are the corresponding eigenvectors. 


167 - 


6 Discriminant Feature Extraction 


6.5 Experiments 

6.5.1 Parametric Classification 


6.5.1. 1 Experiments with Generated Data 

To evaluate closely how the proposed algorithm performs under various 
circumstances, tests are conducted on data generated with given statistics 
assuming Gaussian distributions. In all parametric examples, a Gaussian ML 
classifier is used. 


Example 6.1 In this example, data are generated for the following statistics. 


= 



1 0.5" 
0.5 1 



1 0.5" 
0.5 1. 


P(co 1 ) = P(co 2 ) = 0.5 


200 samples are generated for each class. Since the covariance matrices are 
the same, it can be easily seen that the decision boundary will be a straight line 
and just one feature is needed to achieve the same classification accuracy as in 
the original space. The eigenvalues Xj and the eigenvectors <J)j of Eqfm are 

calculated as follows: 




= 0.99971 X 2 = 0.00029 <t> t = 



Since one eigenvalue is significantly larger than the other, it can be said that 
the rank of I DFM is 1 • That means only one feature is needed to achieve the 
same classification accuracy as in the original space. Table 6.2 shows the 
classification accuracies. The proposed algorithm finds the right feature 
achieving the same classification accuracy as in the original space with one 
feature. 
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Table 6.2 Classification accuracies of Example 6.1. 


No. Features 

Classification Accuracy 

1 

95.8 (%) 

2 

95.8 (%) 


Example 6.2 In this example, data are generated with the following statistics. 


M 1 = 




P(o) 1 ) = P(co 2 ) = 0.5 

200 samples are generated for each class. In this case, there is almost no 
difference in the mean vectors. The variance of feature 1 of class coi is equal to 
that of class co 2 while the variance of feature 2 of class is larger than that of 
class co 2 . Thus the decision boundary will consist of two hyperbolas. However, 
the effective decision boundary could be approximated by a straight line. As a 
result, only one feature may be needed to achieve almost the same 
classification accuracy as in the original space. The eigenvalues X t and the 
eigenvectors <t>; of £qfm are calculated as follows. 

ro.091, r-i.ooi 
X 1 = 0.99330 X 2 = 0.00670 4>i = j_-| 0 oJ 0.09 J 

Since one eigenvalue is significantly larger than the other, it can be said that 
the rank of I DFM is 1 • That means on| y one feature is needed t0 achieve the 
same classification accuracy as in the original space. Considering the statistics 
of the two classes, the rank of I DFM gives the correct number of features to 
achieve the same classification accuracy as in the original space. Table 6.3 
shows the classification accuracies. The proposed algorithm find the right 
feature achieving the same classification accuracy as in the original space with 

one feature. 


Table 6.3 Classification accuracies of Example 6.2. 


No. Features 

Classification Accuracy 

1 

r“" 6i .0 (%) 

1 2 

61 .0 (%) 
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From the examples, it can be seen that the proposed discriminant feature 
extraction algorithm finds a good feature set even though there is no class 
mean difference (Example 6.2) or no class covariance difference (Example 6.1). 


6.5.1. 2 Experiments with Real Data 

Along with the proposed Discriminant Feature Extraction, five other 
feature selection/extraction algorithms, Uniform Feature Design, Principal 
Component Analysis (the Karhunen-Loeve transformation) (Richards 1986), 
Canonical Analysis (Richards 1986), the Foley & Sammon method (Foley and 
Sammon 1975), and Decision Boundary Feature Extraction are tested to 
evaluate and compare the performance of Discriminant Feature Extraction. The 
Foley & Sammon method is based on the generalized Fisher criterion (Foley 
and Sammon 1975). For a two class problem, the Foley & Sammon method is 
used for comparison. If there are more than 2 classes, Canonical Analysis is 
used for comparison. 

In the following test, two classes are chosen from the FSS data. Table 6.4 
provides information on the classes. Figure 3.17 in Chapter 3 shows the mean 
graph of the two classes. 


Table 6.4 Class description of data collected at Finney Co. KS. 


SPECIES 

No. of Sample 


WINTER WHEAT 

691 

400 

UNKNOWN CROPS 

619 

400 


Figure 6.3 show a performance comparison. First the original data set is 
reduced to a 17-dimensional data set using Uniform Feature Design. With 17 
features, the classification accuracy is 95.5%. Discriminant Feature Extraction 
achieves 91.2% and 93.7% with one and two features, respectively. Though 
Discriminant Feature Extraction showed a better performance than Principal 
Component Analysis and Uniform Feature Design, Decision Boundary Feature 
Extraction and the Foley & Sammon method show the best performance, 
achieving about the maximum possible classification accuracy with one feature. 
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In the following test, two classes are chosen from the FSS data. Table 6.5 
provides information on the classes. Figure 3.19 in Chapter 3 shows the mean 
graph of the two classes. There is relatively little difference in the mean vectors. 


Table 6.5 Class description of data collected at Finney Co. KS. 


SPECIES 

11 III Ilf 


WINTER WHEAT 

223 

223 

SPRING WHEAT 

474 

474 
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Figure 6.4 show a performance comparison. With 25 features, the classification 
accuracy is 92.4%. Decision Boundary Feature Extraction and Discriminant 
Analysis show similar performance, outperforming other methods. 
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Figure 6.4 Performance comparison. 


In the following test, 4 classes are chosen from the data collected at 
Hand Co. SD. on May 15, 1978. Table 6.6 provides class information. Figure 
3.23 in Chapter 3 shows the mean graph of the 4 classes. 
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Table 6.6 Class description. 


Species 

Date 



Winter Wheat 

May 15, 1978 

223 

223 

Native Grass Pas 

May 15. 1978 

r ~ 196 

196 

Oats 

May 15. 1978 

163 

163 

Unknown Crops 

May 15, 1978 

253 

253 


Figure 6.5 shows a performance comparison. In this experiment, Decision 
Boundary Feature Extraction and Discriminant Feature Extraction outperform 
other methods. Though Decision Boundary Feature Extraction and Discriminant 
Feature Extraction show similar performance, Decision Boundary Feature 
Extraction shows better performance when 7-10 features are used. 
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Figure 6.5 Performance comparison. 
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In the next test, 6 classes chosen from the FSS data. Table 6.7 provides 
description of the 6 classes. In this test, 300 samples are used for training and 
the rest are used for test. 

Table 6.7 Class description of the multi-temporal 6 classes. 


| Date 

Location 

Species 

No. Sample I 

770308 

Finney CO. KS. 

Winter Wheat 

691 

770626 

Finney CO. KS. 

Winter Wheat 

677 

771018 

Hand CO. SO. 

Winter Wheat 

662 

770503 

Finney CO. KS. 

Winter Wheat 

658 

770626 

Finney CO. KS. 

Summer Faibw 

643 

780726 

Hand CO. SD. 

Spring Wheat 

518 
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CO 

.9 
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Figure 6.6 Performance comparison. 


Figure 6.6 shows a performance comparison. In this example, 
Discriminant Analysis shows the best performance until 3 features are used. 
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When more than 2 features are used, Decision Boundary Feature Extraction 
shows the best performance. 


6.5.2 Non-Parametric Classifications 


6.5.2. 1 Experiments with Generated Data 

The non-parametric classifier was implemented by Parzen density 
estimation using a Gaussian kernel function. 


Example 6.3 In this example, class co-| is normal with the following statistics: 


M, = 




3 0.5' 
0.5 3. 


And class o) 2 is equally divided between two normal distributions with the 
following statistics: 


M 


1 

2 = 




" 2 0.5' 
0.5 2. 


and M 2 


raw 

= |_-3j ^2-[ 


2 0.5' 
0.5 2. 


200 samples are generated for each class. Eigenvalues and eigenvectors 
of I DFM are calculated as follows: 

J r 0.681 ro.741 

^ = 0.74820, \ 2 =0-251 80 and = L-0.74J* ^2 = [o.68j 


Figure 6.7 shows the distribution of the data and the eigenvectors of Z DFM . Table 
6.8 shows the classification accuracies. The proposed algorithm find the right 
feature achieving the same classification accuracy as in the original space with 

one feature. 
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xl 

Figure 6.7 Data distribution of Example 6.3 and the feature vectors found by 
the discriminant feature extraction method. 


Table 6.8 Classification accuracies of Example 6.3. 


No. Features 

Classification Accuracy 

1 

90.3 (%) 

2 

90.3 (%) 


6. 5.2.2 Experiments with Real Data 

In the next test, there are 3 classes and each class has 2 subclasses. In 
other words, 2 subclasses were combined to form a new class. By purposely 
combining data from different classes, the data are made to be multi-modal. 
Table 6.9 provides information on the classes. Figure 4.14 in Chapter 4 shows a 
mean value graph of the 6 subclasses and Figure 4.15 in Chapter 4 shows a 
mean value graph of the 3 classes each of which has 2 subclasses. 500 
randomly selected samples from each classes are used as training data and the 
rest are used for test. 
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Table 6.9 Class description. 


Class 

Subclass 

No. of Samples 

Total No. of Sample | 

Class co 1 

Winter Wheat 
March 8, 1 977 

691 

1209 

Spring Wheat 
July 26, 1978 

518 

Class 0)2 

Winter Wheat 
June 26, 1977 

677 

1146 

Spring Wheat 
Seo. 21, 1978 

469 

Class 0)3 

Winter Wheat 
Oct. 18, 1977 

662 

1103 

Spring Wheat 
Oct. 26, 1978 

441 



Figure 6.8 shows a performance comparison. Discriminant Feature Extraction 
and Decision boundary Feature Extraction show similar performance, 
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outperforming other methods. It is noted that, in Discriminant Feature Extraction, 
the decision boundary need not to be found, reducing computing time. 

In the next test, there are 3 classes and each class has 2 subclasses. In 
other words, 2 subclasses were combined to form a new class. Table 6.10 
provides information on the classes. 500 randomly selected samples from each 
classes are used as training data and the rest are used for test. 


Table 6.10 Class description. 


Class 

Subclass 

Mo. of Samples 

Total No. of Sample) 

Class ox, 

Winter Wheat 
March 8, 1977 

691 

1145 

Spring Wheat 
July 9, 1978 

454 

Class o >2 

Winter Wheat 
June 26, 1977 

677 

1195 

Spring Wheat 
July 26, 1978 

518 

Class cog 

Winter Wheat 
Oct. 18, 1977 

662 

1103 

Spring Wheat 
Oct. 26, 1978 

441 


Figure 6.9 shows a performance comparison. The classification accuracy 
with 17 features is 96.0%. Discriminant Analysis shows the best performance 
until 3 features are used. However, when more than 2 features are used, 
Decision Boundary Feature Extraction and Discriminant Feature Extraction 
outperform all other methods. Overall, Discriminant Feature Extraction and 
Decision boundary Feature Extraction show similar performances. 
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6.6 Conclusion 

In this chapter, we considered feature extraction as eliminating features 
which have no impact on the value of the discriminant function. In order to find 
the discriminantly irrelevant features which have no impact on the value of the 
discriminant function, we defined the discriminant feature matrix and showed 
that eigenvectors of the discriminant feature matrix corresponding to zero 
eigenvalues are discriminantly irrelevant features and can be eliminated 
without increasing classification error. Then we proposed a procedure for the 
discriminant feature extraction algorithm. 
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We compared the discriminant feature extraction method with the 
decision boundary feature extraction method in the previous chapters and 
showed that the discriminant feature extraction method is a generalized form of 
the decision boundary feature extraction method. When the decision boundary 
is well defined and can be easily found as in parametric classifiers, the decision 
boundary feature extraction feature extraction method gives better 
performances. However, if the decision boundary is not well defined or difficult 
to find as in some non-parametric classifiers, the discriminant feature extraction 
method gives comparable performance without the need to find the decision 
boundary which is very time-consuming in some non-parametric classifiers. 
Furthermore, by generalizing the concept, the technique can be used for 
constructing a matrix from vectors which are useful for a given problem, such as 
non-parametric regression, density estimation, and feature extraction for other 
applications. 

Experiments show that the discriminant feature extraction method can be 
used for parametric and non-parametric classifiers, and does not deteriorate 
even if there is no difference in mean vectors or in covariance matrices. 
Although the discriminant feature extraction method was developed for the 
discriminant function which uses a posteriori probabilities, it can be used for any 
discriminant function. 


180 - 


CHAPTER 7 ANALYZING HIGH DIMENSIONAL DATA 


7.1 Introduction 

Developments with regard to sensors for Earth observation are moving in 
the direction of providing much higher dimensional multispectral imagery than 
is now possible. The HIRIS instrument now under development for the Earth 
Observing System (EOS), for example, will generate image data in 192 spectral 
bands simultaneously (Goetz 1989). MODIS (Ardanuy et al. 1991), AVIRIS 
(Porter et al. 1990) and the proposed HYDICE are additional examples. 
Although conventional analysis techniques primarily developed for relatively 
low dimensional data can be used to analyze high dimensional data, there are 
some problems in analyzing high dimensional data which have not been 
encountered in low dimensional data. In this chapter, we address some of these 
problems. In particular, we investigate (1) the relative potential of first and 
second order statistics in discriminating between classes in high dimensional 
data, (2) the effects of inaccurate estimation of first and second order statistics 
on discriminating between classes, and (3) a visualization method for second 
order statistics of high dimensional data. 


7.2 First and Second Order Statistics in High Dimensional Data 

The importance of the second order statistics in discriminating between 
classes in multispectral data was recognized by Landgrebe (1971). In that 
study, it was found that small uncorrelated noise added to each band caused a 
greater decrease in classification accuracy than larger correlated noise. We 
begin with a test to investigate the role of first and the second order statistics in 
high dimensional data. The test was done using FSS (Field Spectrometer 
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System) data obtained from a helicopter platform (Biehl et al. 1982). Table 7.1 
shows major parameters of FSS. 


Table 7.1 Parameters of Field Spectrometer System (FSS). 


Number of Bands 

60 

Spectral Coverage 

0.4 - 2.4 pm 

Altitude 

60m 

IFOV(ground) 

25m 


In order to evaluate the roles of first and second order statistics in high 
dimensional data, three classifiers were tested. The first classifier is the 
Gaussian Maximum Likelihood (ML) classifier which utilizes both class mean 
and class covariance information. For the second case, the mean vectors of all 
classes were made zero. Thus, the second classifier, which is a Gaussian ML 
classifier, is constrained to use only covariance differences among classes. The 
third classifier is a conventional minimum distance classifier (Richards 1986) 
which utilizes only first order statistics (Euclidean distance). Note that the first 
and third classifiers were applied to the original data set; the second classifier 
was applied to the modified data set where the mean vectors of all classes were 
made to zero so that there were no mean differences among classes. 

To provide data with different numbers of spectral features, a simple 
band combination procedure, referred to as Uniform Feature Design, was used. 
In this procedure, adjacent bands were combined to form the desired number of 
features. For example, if the number of features is to be reduced from 60 to 30, 
each two consecutive bands are combined to form a new feature. Where the 
number of features desired is not evenly divisible into 60, the nearest integer 
number of bands is used. For example, for 9 features, the first 6 original bands 
were combined to create the first feature, then the next 7 bands were combined 
to create the next feature, and so on. 

In the following test, 12 classes were selected from FSS data. The 
selected data were multi-temporal. Table 7.2 provides information on the 12 
classes. 100 randomly selected samples were used as training data and the 
rest were used as test data. Figure 7.1 shows the graph of the class mean 
values of the 1 2 classes. 
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Table 7.2 Description of the multi-temporal 12 classes. 


Date 


Species ! 

No. of Samples | 

770308 



691 

770626 

i gjnCTaaaaJ 


677 

771018 



662 



■EBSSEBESMI 

658 

770626 



643 

780726 

KBEEBEM 


518 

780602 

I 

3 

i 


517 

780515 



474 

780921 

! 

1 

M 


469 1 

780816 

Hand CO. SD. 


464 

780709 

! 


454 

781026 



441 


40 


30 


<B 

"O 

1 20 
o> 


10 


0 10 20 30 40 

Spectral Band 

Figure 7.1 Class means of the 12 multi-temporal classes. 

The original 60 band data were reduced using Uniform Feature Design 
to 1 through 20 feature data and the three classifiers were tested on the 
reduced feature sets (1 through 20). Figure 7.2 shows a performance 
comparison of the three classifiers. As expected, the Gaussian ML classifier 
performs better that the other two classifiers, achieving 94.8% with 20 features. 
On the other hand, the minimum distance classifier achieved about 40 % 
classification accuracy with 20 features. Actually the performance of the 
minimum distance classifier was saturated after four features. Meanwhile, the 
classification accuracies of the Gaussian ML classifier with zero mean data 
continuously increased as more features were used achieving 73.2% with 20 
features. In low dimensionality (no. of features < 4), the minimum distance 
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classifier shows better performance than the Gaussian ML classifier with zero 
mean data. When more than 3 features are used, the Gaussian ML classifier 
with zero mean data shows better performance than the minimum distance 
classifier. 



Figure 7.2 Performance comparison of the Gaussian ML classifier, the 
Gaussian ML classifier with zero mean data, and the minimum 
distance classifier. 


Figures 7.3-4 show the performances of the minimum distance classifier 
and the Gaussian ML classifier with zero mean data for various number of 
classes. It is interesting that the performances of the minimum distance classifier 
reached saturation with 4-5 features and after that adding more features did not 
make any significant change in classification accuracy. On the other hand, the 
performances of the Gaussian ML classifier with zero mean data shows 
improvements as more features are used as can be seen in Figure 7.4. 
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Figure 7.3 


Number of Features 
2 classes, June 26, 1977 
2 classes, June 2, 1978 
4 classes, July 9, 1978 
4 classes, Sep. 21, 1978 
8 classes, multi-temporal data 
12 classes, mull-temporal data 
40 classes, multi-temporal data 

Performance of the minimum distance classifier. 
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a — 2 classes, June 26 y 1977 

# — 2 classes, June 2, 1978 

■ — 4 classes, July 9, 1 978 

» 4 classes, Sep. 21 , 1 978 

■ — 8 classes, multi-temporal data 

a — 12 classes, muli-temporal data 

* — 40 classes, multi-temporal data 

Figure 7.4 Performance of the Gaussian ML classifier with zero mean data. 


It can be observed from the experiments that the second order statistics 
play an important role in high dimensional data. The ineffectiveness of the 
minimum distance classifier, which does not use second order statistics, is 
particularly noteworthy. Though the Euclidean distance is not as effective a 
measure as other distance measures which utilize the second order statistics, 
the minimum distance classifier is still widely used in relative low dimensional 
data due to computation cost. In particular, in computationally intensive tasks 
such as clustering, the Euclidean distance is widely used. 

It is noteworthy that, in the low dimension case, class mean differences play a 
more important role in discriminating between classes than the class 
covariance differences. However, as the dimensionality increases, the class 
covariance differences become more important, especially when adjacent 
bands are highly correlated and there are sizable variations in each band of 
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each class. This suggests that care must be exercised in applying classification 
algorithms such as the minimum distance classifier to high dimensional data. 

7.3 Minimum Distance Classifier in High Dimensional Space 

7.3.1 Average Class Mean Difference and Average Distance From Mean 

We will next further investigate the performance of the minimum distance 
classifier in high dimensional remotely sensed data. In order to analyze 
qualitatively the performance of the minimum distance classifier, the Average 
Class Mean Difference (ACMD) is defined as follows: 

2 L i-1 

Average Class Mean Difference (ACMD) = L (L-l )X 2 l M i “ M jl 

' / i-2 j-1 


where L is the number of classes and Mj is the mean of class a>j. 


Generally, increasing the ACMD should improve the performance of the 
minimum distance classifier. Similarly, the Average Distance From Mean 
(ADFM) is defined as follows: N 

Average Distance From Mean (ADFM) = £ l x F 

i-1 j-1 

where N is the total number of samples; 

L is the number of classes; 

Nj is the number of samples of class coji 
xj is the j-th sample of class coj; 

Mj is the mean of class ©j. 


The ADFM is thus the average distance that samples are located from the 
mean. Generally, decreasing ADFM will improve the performance of the 
minimum distance classifier. Figure 7.5 shows the ACMD and the ADFM of the 
12 classes of Table 7.2. As can be seen, the ACMD increases as more features 
are added. However, the ADFM also increases. 
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Figure 7.5 Graph of the Average Class Mean Difference and the Average 
Distance From Mean of the 12 classes of Table 7.2. 
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Figure 7.6 Ratio of the Average Class Mean Difference and the Average 
Distance From Mean of the 12 classes of Table 7.2. 


Figure 7.6 shows the ratio of the ACMD and the ADFM. Note that the ratio 
increases up to 3 features and then is saturated thereafter. Though one should 
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expect variations in this effect from problem to problem, the implication is that 
the performance of classifiers which mainly utilize class mean differences may 
not improve much in high dimensional data, especially when correlation 
between adjacent bands is high. 


7.3.2 Eigenvalues of Covariance Matrix of High Dimensional Data 


In high dimensional remotely sensed data, there is frequently very high 
correlation between adjacent bands, and most data are distributed along a few 
major components. Table 7.3 shows the eigenvalues (ordered by size) of the 
covariance matrix estimated from 643 samples of Summer Fallow collected at 
Finney County, Kansas in July 26, 1977, as well as proportions and 
accumulations of the eigenvalues. Figure 7.7 shows the magnitude of 
eigenvalues on a log scale. As can be seen, there are very large differences 
among eigenvalues. The ratio between the largest eigenvalue and the smallest 
is on the order of 10 6 . A few eigenvalues are dominant and the rest are very 

small in value. 

It can be seen from Table 7.3 that the largest 3 eigenvalues account for more 
than 95% of the total mean square value. The largest 8 eigenvalues account for 
more than 99%. Most variation of data occurs along a few eigenvectors 
corresponding to the largest eigenvalues and there is very little variation in the 
other eigenvectors. This indicates that, assuming a Gaussian distribution, the 
data will be distributed in the shape of an elongated hyperellipsoid with its 
origin at the mean of the data and whose semi-axes are in the directions of the 
eigenvectors of the covariance matrix of the data with lengths proportional to the 
corresponding eigenvalues. Since the lengths of the semi-axes are proportional 
to the eigenvalues, there are very large differences among the lengths of the 

semi-axes. 
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Table 7.3 Eigenvalues of covariance of high dimensional remotely sensed 
data. 
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Without utilizing the second order statistics, a classifier such as the 
minimum distance classifier assumes that data are distributed in the shape of a 
hypersphere instead of hyperellipsoid. As a result, the minimum distance 
classifier defines a very ineffective decision boundary, particularly in high 
dimensional data. Figure 7.8 shows an example in two dimensional space. The 
two classes in Figure 7.8 are, in fact, quite separable by using second order 
statistics which give the information about the shape of the distribution, and in 
particular, the major component along which most data are distributed. 
However, the minimum distance classifier, using only the first order statistics, 
defines a very unsatisfactory decision boundary, causing avoidable errors. This 
phenomenon becomes more severe if data are distributed along a few major 
components. On the other hand, if classes are distributed in the shape of 
hypersphere, the minimum distance classifier will give a better performance. 



Figure 7.8 Classification error of the minimum distance classifier. 
7.3.3 Determinant of Covariance Matrix of High Dimensional Data 

The determinant is equal to the product of the eigenvalues, i.e., 

N 

DET = 

i-i 
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As can be seen in Table 7.3, most of the eigenvalues of the covariance matrix of 
high dimensional data are very small in value. Therefore, determinants of high 
dimensional remotely sensed data will have very small values. Figure 7.9 
shows the magnitudes of the determinants of the 12 classes for various number 
of features. In low dimensionality, the differences of determinants among 
classes are relatively small. As the dimensionality increases, the determinants 
decrease exponentially, indicating that the data are distributed in the highly 
elongated shape. In addition, there are significant differences between classes, 
indicating that there are significant differences in the actual volumes in which 
the classes are distributed. 



N-1 

N-2 

N-3 

N-4 

N-5 

N-7 

N-10 

N-12 

N-1 4 

N-1 6 

N-18 

N«20 


Figure 7.9 Determinant of the 12 classes. 
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7.4 Diagonalizing and The Number of Training Samples 
7.4.1 Diagonalizing the Data 

The limited performance of the minimum distance classifier in the 
previous section is mainly due to the fact that there is very high correlation 
between adjacent bands in high dimensional remotely sensed data. As a result, 
it is difficult to evaluate the roles of class mean differences and class covariance 
differences in discriminating between classes in high dimensional data. To 
better compare the roles of class mean differences (first order statistics) and 
class covariance differences (second order statistics), the entire data set is 
diagonalized (Fukunaga 1990), i.e., a linear transformation is applied to the 
data such that the transformed data will have a unit covariance matrix. Let, 

1 t 

Y= A"2<D l X 

where O is a matrix whose column vectors are the eigenvectors of lx- the 

covariance matrix of the original data 
A is a diagonal matrix whose diagonal elements are eigenvalues of Ix> 

the covariance matrix of the original data 


Then the covariance matrix of the transformed data Y, ly wil1 an identity 
matrix, i.e., 

Zy — I 

It will be seen that this linear transformation affects only the performance of the 
minimum distance classifier. The performance of the Gaussian ML classifier is 
invariant under any linear transformation 2 since 

(X-Mx^Z'x^-Mx) 

where Mx is the mean vector of X and lx ' s the covariance matrix of X 


2 Note that this implies that any preprocessing procedure, e.g. calibration which ® 

linear transformation of the data will not affect classification accuracy for a Gaussian ML 

classifier. 
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is invariant under any linear transformation if the transformation matrix is non- 
singular (Fukunaga 1990). After diagonalizing, it is expected that the 
performance of the minimum distance classifier will be improved since the 
diagonalization process makes the data distribution closer to the shape of 
hypersphere (Figure 7.8). 



■o — Gaussian ML (Mean & Cov) 

Gaussian ML (Cov only) 

■o — Minimum Distance Classifier 

-a — Minimum Distance Classifier (after diagonalizing) 


Figure 7.10 Performance comparison (100 training samples). 


Figure 7.10 shows the classification accuracy vs. numbers of features 
after diagonalization. There are 40 multi-temporal classes. 100 randomly 
selected samples are used for training data and the rest are used for test. As 
expected, the Gaussian ML classifier shows the best performance and the peak 
accuracy of the Gaussian ML classifier occurs when the number of features is 
31, achieving 82.8%. When more than 31 features are used, the performance of 
the Gaussian ML classifier begins to decrease slightly, indicating the Hughes 
phenomenon is occurring (Hughes 1968). The Gaussian ML classifier applied 
to the zero-mean data also shows peak performance with 31 features, 
achieving 62.4% classification accuracy. When more than 31 features are used, 
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the Gaussian ML classifier applied to the zero-mean data also shows the 
Hughes phenomenon. 

The minimum distance classifier applied to the original data shows very 
limited performance, achieving just 26.6% classification accuracy with 40 
features. In fact, the performance of the minimum distance classifier is saturated 
after 4 features. After diagonalization, the performance of the minimum distance 
classifier is greatly improved, achieving 64.8% classification accuracy with 36 
features. It appears that, when the data are diagonalized, class mean 
differences are more important than class covariance differences in 
discriminating between classes in this example. However, the difference in 
classification accuracy decreases as dimensionality increases. For example, 
when 4 features are used, the classification accuracy of the minimum distance 
classifier applied to the diagonalized data is 35.3% while the classification 
accuracy of the Gaussian ML classifier applied to the zero-mean data is 19.8%, 
a difference of 15.5%. When 31 features are used, the classification difference 
is just 1.3% It is interesting that the Hughes phenomenon of the minimum 
distance classifier occurs later compared with the Gaussian ML classifier. A 
possible reason is that the number of parameters the minimum distance 
classifier uses is much smaller than the number of parameters the Gaussian ML 

classifier uses. 

7.4.2 Estimation of Parameters and Number of Training Samples 

In supervised classification, parameters are estimated from training data. 
When the parameter estimation is not accurate, the performance of the classifier 
is affected. In particularly, when the number of training data is limited, adding 
more features does not necessarily improve the classification accuracy. In this 
section, we will illustrate how inaccurate estimation of parameters affect the 
performance of the minimum distance classifier and the Gaussian ML classifier 
applied to the zero-mean data. 

Generally, the classification error is a function of two sets of data, training and 
test data and can be expressed by (Fukunaga 1990) 


£ (©train>®test) 
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where 0t ra in anc * ©test are a set 0 * parameters of training and test data. 

In Fukunaga (1990), it is shown that the Bayes error, e(0,0), is bounded by two 
sample-based estimates, i.e., 

E{e(0,0)} <e(0,0) £ E0 tes t{e(0,0 t est)} (7-1) 

The term e(0,0 t est) is obtained by generating two independent sample sets, © 
and ©test- an< ^ using © for training and ©test f° r testing. e(0,0) is obtained by 
using the same data for training and test. 

In the following test, the 3 classifiers are tested on the 40-class problem 
(Table 3.14). The average number of samples of the 40 classes is about 300. To 
obtain a lower bound of the Bayes error, all data are used for training and test 
(resubstitution method) (Fukunaga 1990). The leave-one-out method 
(Fukunaga 1990) is also used to obtain an upper bound of the Bayes error. 

Figure 7.11 shows the performance comparison of the resubstitution method 
and the leave-one-out method. Let’s compare Figure 7.11 and Figure 7.10 
where 100 randomly chosen samples are used for training. When 40 features 
are used, the classification accuracy of the Gaussian ML classifier improved 
from 81.3%(100 training samples) to 93.8%(all data are used for training). 
However, the improvement of the Gaussian ML classifier applied to the zero- 
mean data is particularly noteworthy. The classification accuracy increased from 
60.5%(100 training samples) to 86.1 %(all data are used for training) with 40 
features. When 100 training samples are used, the difference of the 
classification accuracies of the Gaussian ML classifier applied to the original 
data and the Gaussian ML classifier applied to the zero-mean data was 20.8% 
with 40 features. When all samples are used for training, the difference is 
reduced to 7.7%. On the other hand, the performance of the minimum distance 
classifier improves only slightly. The classification accuracy of the minimum 
distance classifier applied to the original data increased from 26.6%(100 
training samples) to 27.5%(all data are used for training) with 40 features, and 
the classification accuracy of the minimum distance classifier applied to the 
diagonalized data increased from 64.2%(100 training samples) to 67.3%(all 
data are used for training). 
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In the leave-one-out method, the accuracy improvements are smaller. The 
classification accuracy of the Gaussian ML classifier is about 85.9% with 40 
features and 71 .9% for the Gaussian ML classifier with zero mean data. The 
classification accuracy of the minimum distance classifier is 66.1% with 40 

features. 
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Number of Features 

Figure 7.12 Performance comparison of the minimum distance classifier 
applied to the diagonalized data and the Gaussian ML 
classifier with the zero mean data for various numbers of 
training samples. 


Figure 7.12 shows the classification accuracy vs. number of features 
when various numbers of training samples are used. Note that the performance 
of the Gaussian ML classifier with the zero mean data greatly improved when all 
data are used for training or the leave-one-out method is used, while the 
performance of the minimum distance classifier improved slightly. It is noted 
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that, when all data are used for training or the leave-one-out method is used, 
the Gaussian ML classifier applied to the zero-mean data outperforms the 
minimum distance classifier in high dimensionality. Since the Bayes error is 
bounded by the two sample-based estimates (equation 7.1), it appears that the 
second order statistics play an increased role in discriminating between classes 

in high dimensionality. 

In Figure 7.12, the difference between the resubstitution method and the 
leave-one-out method is large, resulting in a loose bound on the Bayes error. A 
reason for the large difference is that some of the classes have a relatively small 
number of samples. To overcome that problem, we generated data from the 
statistics estimated from the classes. 1000 samples were generated for each 
class The resubstitution method and the leave-one-out method were applied to 
obtain a lower and a upper bound. Figure 7.13 shows the result. The 
classification accuracies of the Gaussian ML classifier, the Gaussian ML 
classifier applied to zero mean data, and the minimum distance classifier are 
99.5%, 97.5%, and 56.9%, respectively, when all data are used for training and 
test, and 99.2%, 96.0%, and 54.3%, respectively, when the leave-one-out 
method is used. It is interesting that the Gaussian ML classifier applied to zero 
mean data shows almost the same performance as the Gaussian ML classifier 
applied to the original data in high dimensionality, significantly outperforming 
the minimum distance classifier. 
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Figure 7.13 Performance comparison of generated data when all data are used 
for training and test, and the leave-one-out (L-method) is used. 


In practice, estimation of the second order statistics of high dimensional 
data is a difficult problem, particularly when the number of training samples are 
limited. However, these results suggest that second order statistics provide a 
great potential for discriminating between classes in high dimensionality if the 
second order statistics can be accurately estimated. In many feature extraction 
algorithms, the lumped covariance is used [(Fukunaga 1990) and (Foley and 
Sammon 1975)]. However, the above results indicate that covariance 
differences among classes also provides important information in discriminating 
between classes in high dimensional data. Recently the possibility of obtaining 
a better estimation of parameters using a large number of unlabeled samples in 
addition to training samples has been shown, and this should be particularly 
relevant in the case of high dimensional data (Shahshahani and Landgrebe 
1992). 
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7 Analyzing High Dimensional Data 


7.5 Visualization of High Dimensional Data 

As the dimensionality of data increases, it becomes more difficult to 
compare class statistics, and in particular, the second order statistics. For 
instance, it would not be feasible to print out mean vectors and covariance 
matrices of 200 dimensional data and compare them manually. Table 7.4 
shows an example of a 20 dimensional correlation matrix. It is very difficult to 
manually perceive much from the numerical values; some type of visualization 
aid seems called for. 


Table 7.4 Correlation Matrix of 20 dimensional data. 


1.00 

0.95 1.00 

0.97 
0.96 
0.95 

0.91 0.94 

0.90 0.93 
0.89 0.92 

0.88 0.91 

0.86 0.89 

0.85 0.88 
0.83 0.86 

0.82 0.85 
0.81 0.84 

0.79 0.82 

0.77 0.80 
0.76 0.78 
0.76 0.78 

0.75 0.77 

0.74 0.75 


correlation using gray levels (Kim and Swain 1990). We further elaborate on 
this method and propose a visualization method of mean vectors and 
covariance matrices along with standard variations using a color coding 
scheme and a graph. We will call this visualization method of statistics the 
statistics image. Figure 7.14 shows the format of the statistics image. Statistics 
images consists of a color-coded correlation matrix, a mean graph with 
standard deviation and a color code. Figure 7.15 shows the palette design for 
the color code. Figure 7.16 shows the actual look of the color code for 
correlation matrix in gray scales. The color changes continuously from blue to 
red with blue indicating a correlation coefficient of -1 and red indicates that the 
correlation coefficient is 1 . In the mean graph part, the mean vector is displayed 
plus or minus one standard deviation. At the bottom of the statistics image, the 
color code is added for easy comparison. 
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Kim and Swain proposed a method to visualize the magnitude of 


201 



7 Analyzing High Dimensional Data 


Color-coded 
correlation matrix 


Mean graph 
with standard 
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| Color code scheme ~] 

Figure 7.14 Format of the statistics image. 



Blue 


Green 


Red 


Figure 7.15 Palette design. 


Figure 7.17 shows the statistics images of Spring Wheat, Oats, Summer Fallow, 
and Native Grass Pasture which were collected on July 26, 1978 in gray scale. 
The green lines in the images indicate water absorption bands. At a glance, one 
can subjectively perceive how each band is correlated and easily compare the 
statistics of the different classes. It is easy to see that there are significant 
differences in the class correlation, suggesting probable separability via a 
classifier. Figure 7.18 shows the statistics images of Spring Wheat collected on 
May 15 1978, June 2 1978, July 26 1978, and August 16 1978 in gray scale. 
The statistics images clearly show how the statistics of the Spring Wheat have 
changed over the period. The statistics image will provide a valuable means in 
visualizing statistics of high dimensional data. 
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Correlation -1 -0.5 

Coefficient , , , . x 

Figure 7.16 The actual look of the color code (gray scale). 



Fiaure7.17 Statistics images of spring wheat, oats, summer fallow, and 
native grass pasture on July 26, 1978 (gray scale). 



Figure 7.18 Statistics images of spring wheat over 4 months period (gray scale). 
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7 Analyzing High Dimensional Data 


7.6 Conclusion 

Advancement in sensor technology will provide data in much higher 
dimensions than previous sensors. Although such high dimensional data will 
present a substantial potential for deriving greater amounts of information, some 
new problems arise that have not been encountered in relatively low 
dimensional data. In this chapter, we examined the possible roles of first and 
second order statistics in discriminating between classes in high dimensional 
space. It is observed that a conventional minimum distance classifier which 
utilizes only the first order statistics failed to fully exploit the discriminating 
power of high dimensional data. By investigating the characteristics of high 
dimensional remotely sensed data, we demonstrated the reason for this limited 
performance. We also investigated how the degree of accuracy in estimating 
parameters affects the performance of classifiers and especially the potential of 
second order statistics in discriminating among classes in high dimensional 
data. 


Recognizing the importance of second order statistics in high dimension 
data, it is clear that there is a greater need to better represent the second order 
statistics. For that purpose, we proposed a visualization method of the first and 
the second order statistics using a color coding scheme. By displaying the first 
and the second order statistics using this scheme, one can more easily 
compare spectral classes and visualize information about the statistics of the 
classes. 
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CHAPTER 8 SUMMARY AND SUGGESTIONS FOR FURTHER WORK 


8.1 Summary 

In this research, three main subjects are studied: (1) fast likelihood 
classification: (2) a new feature extraction algorithm; (3) characteristics of high 
dimensional data and problems in analyzing high dimensional data. 

In Chapter 2, a fast likelihood classification was proposed to reduce the 
processing time of high dimensional data. As the dimensionality and the 
number of classes grow, the computation time becomes an important factor. 
Based upon the recognition that only a small number of classes are close to 
each other even when there are a large number of classes, a multistage 
classification was proposed. In the early stages where a fraction of the total 
features are used, classes whose likelihood values are smaller than a threshold 
are truncated, i.e., eliminated from further consideration so that the number of 
classes for which likelihood values are to be calculated at the following stages 
is reduced. It was shown that the computing time can be reduced by a factor of 3 
to 7 using the proposed multistage classification while maintaining essentially 
same accuracies when the Gaussian ML classifier is used. This method will 
make it possible to extract detailed information from high dimensional data 
without increasing the processing time significantly. 

In Chapter 3, a new feature extraction algorithm was proposed which 
better utilizes the potential of high dimensional data. The method is directly 
based on the decision boundary. It was shown that all the necessary features 
for classification can be extracted from the decision boundary. The proposed 
decision boundary feature extraction algorithm has desirable properties. (1) it 
does not deteriorate when there is little or no difference in mean vectors or 
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when there is little or no differences in covariance matrices; (2) it predicts the 
minimum number of features necessary to achieve the same classification 
accuracy as in the original space; (3) it can be used both for parametric 
classifiers and non-parametric classifiers. In Chapter 3, the decision boundary 
feature extraction algorithm was applied to parametric classifiers. It was shown 
the performance of the decision boundary feature extraction method compares 
favorably with those of the conventional methods. 

In Chapter 4, the decision boundary feature extraction algorithm was 
adapted to non-parametric classifiers. Since non-parametric classifiers do not 
define decision boundaries in analytic form, decision boundaries must be found 
numerically. In Chapter 5, the decision boundary feature extraction algorithm 
was applied to neural networks. First, a feature extraction method for neural 
networks using the Parzen density estimator was proposed. To apply the 
decision boundary feature extraction method directly to neural networks, we 
defined the decision boundary in neural networks. From the decision boundary, 
a new feature set is calculated. Experiments showed that the decision boundary 
feature extraction method works well with neural networks. 

In Chapter 6, the discriminant feature extraction method, which is a 
generalization of the decision boundary feature extraction method, was 
proposed. Comparisons between the decision boundary feature extraction 
method and the discriminant feature extraction method were made. 

In Chapter 7, some problems in analyzing high dimensional data are 
investigated. In particular, the increased importance of the second order 
statistics were studied. We also investigated how inaccurate estimation of first 
order and second order statistics affect the performance of classifiers in 
discriminating between classes in high dimensionality. To help human 
interpretation and perception of the second order statistics of high dimensional 
data, a visualization method of the second order statistics using a color code 
and a graph was proposed. 
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8.2 Suggestions for Further Work 

The high dimensional multispectral imagery that future sensors are 
projected to generate will provide a great potential for analyzing the Earth 
resources. For example, the HIRIS instrument will generate image data in 192 
spectral bands. In processing such high dimensional data, there will be many 
challenges to be overcome. It will be almost infeasible to use all 192 bands in 
analysis. First of all, estimation of statistics of such high dimensional data will be 
a very difficult problem, particularly when the number of training samples is 
limited. As a result, using all 192 bands for analysis will not necessarily produce 
an improved result. Figure 8.1 shows an example. There are 6 classes and 
Table 8.1 provides information about the classes. 


Table 8.1 Class description of the multi-temporal 6 classes. 




Species 

No. Sample | 

Mav 3, 1 977 



658 

May 3, 1 977 

■ a 

Unknown Crops 

682 

March 8, 1977 


mKmnsssmm 

691 

March 8, 1977 



619 

June 26, 1977 


wiiuii || iiM 

677 

June 26, 1977 

Finney CO. KS. 

Summer Fallow 

643 


Let 100 randomly selected samples be used for training and the rest for test. As 
can be seen from Figure 8.1, the peak of the classification accuracy occurs 
when 29 features are used. When more than 29 features are used, the 
classification accuracy actually begins to decrease. 
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8 Summary 



Figure 8.1 Classification accuracy vs. number of features. 


In addition, if feature selection/extraction is done based on the estimated 
statistics of such high dimensional data, the resulting feature set may not be 
reliable. Figure 8.2 shows an example. There are 6 classes and Table 8.1 
provides information on the classes. Again let 100 randomly selected samples 
be used for training and the rest for test. The decision boundary feature 
extraction method was applied to 29 dimensional data and 50 dimensional 
data. As can be seen, the classification accuracy with 29 dimensional data is 
better than that with 50 dimensional data. The result indicates that when the 
number of training samples is limited, using more features results in a poorer 
estimation of statistics which, in turn, decreases the performance of the feature 
extraction method which uses the estimated statistics. 
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Number of Features 
a " ■ Uniform Feature Design 

a — Decision Boundary Feature Extraction applied to 29-dimensional data 

— o — Decision Boundary Feature Extraction applied to 50 dimensional data 

Figure 8.2 Feature extraction and number of features. 

Thus, it is desirable to reduce the original dimensionality using some 
kinds of preprocessing techniques such that an estimation of statistics in the 
reduced dimensionality can be reliable. Then feature selection/extraction 
methods can be applied at the reduced dimensionality, further reducing the 
dimensionality. Finally, some classification/analysis techniques can be applied 
to the new data set selected by the feature selection/extraction method. Figure 
8.3 illustrates such a processing scheme for high dimensional data. With the 
FSS data which has 60 spectral bands, it was observed that a dimensionality of 
about 20-30 gave the peak performance for preprocessing. In most cases, IQ- 
20 features gave about the maximum classification accuracy. However, these 
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number can be different depending on the original dimensionality, the 
complexity of problem, the number of available training samples, the quality of 
estimation of statistics, etc. Analytically determining the dimensionality for 
preprocessing and for classification/analysis is one of the important topics in 
analyzing high dimensional data. 



% * 



Figure 8.3 Pre-processing of high dimensional data. 

The preprocessing techniques must not be too complex nor depend too 
much on the estimated statistics. Otherwise, the Hughes phenomenon may still 
occur. In this research, a band combination procedure (Uniform Feature 
Design) has been used as the preprocessing technique. Although the band 
combination procedure (Uniform Feature Design) has given acceptable and 
reliable results, the method is not optimum. Another possible way is to base the 
preprocessing on the estimated statistics of the whole data set [(Wiersma and 
Landgrebe 1980) and (Chen & Landgrebe 1989)]. Using the whole data set, it is 
expected that the estimation of parameters may be more accurate. 

More research in the preprocessing techniques will definitely enrich the 
benefits of the high dimensional data, and improve the performance of 
classifiers and analyzer. 
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Appendix A 

Normal Vector to Decision Boundary 


In order to find the decision boundary feature matrix of a pattern 
classification problem, one must be able to find a vector normal to the decision 
boundary at a point on the decision boundary. Under some conditions which 
are met in most pattern classification problems, one can find a vector normal to 
the decision boundary at a point on decision boundary using the following 
theorem. 

Theorem A.1 If Vh * 0 at X 0 and it is continuous in the neighborhood of X 0 , 
then the vector normal to the decision boundary at X 0 is given by (Faux and 

Pratt 1981) 


Vh(X) (X=X 0 ) 


If the Gaussian ML classifier is used assuming a Gaussian distribution for 
each class, h(X) is given by 


h(X) = ^n ~~~~ = - lnP(X|©i) + lnP(X|co 2 ) 


^ 1 1 
4(X-M , )%' (X - M,) + 5 ln|I, | (X - M 2 ) , I 2 ' (X - M 2 ) - 2 ln|2y 


And Vh will be given by 


Vh = Zi 1 (X - Mt) - I 2 ( x - M 2) = ( ^ " l 2) x + ( z 2 M 2 -S>i) 


Then the vector normal to the decision boundary at X 0 is given by 

N= Vh(X)| x . xo = ( 2‘,’ -4')X o + (Ei 1 M, - (A.1) 


The following theorem gives the point where the straight line connecting 
two points P, and P 2 meets the decision boundary. Theorems A.1 -2 can be 
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employed to implement the proposed procedure to calculate a decision 
boundary feature matrix for parametric classifiers. 

Theorem A.2 If P 1 and P 2 are on different sides of a decision boundary 
h(X) = t assuming that a Gaussian ML classifier is used, the point X 0 where 
the line connecting P 1 and P 2 passes through the decision boundary is 
given by 


X 0 =uV+ V 0 


(A.2) 


where V 0 = P 1 

V = P 2 -Pi 

t - c’ n 
u = — ^ — if a = 0, 

u = — — — 2 ~ a 4a ^ ^ and 0 ^ u ^ 1 if a * 0, 

a = ^V , (i:‘ 1 1 -S 2 )V I 

b = VqW 1 - 2^)V - (M^i 1 - M 2 Z 2 )V, 
c' = |v 0 t (l' 1 1 - Z 2 )V 0 - (M^i 1 - M 2 Z 2 1 )V 0 + c, 

c = \ (M^i 1 Mi - M 2 Z 2 M 2 ) + \ Inj^j 


Proof: h(X) is given by 

h(X) =-ln^j^| = -lnP(X|co 1 ) + lnP(X|co 2 ) 

= \ (X - MO‘S' 1 1 (X - M,) ln|X-d (X - M^z:, 1 (X - M 2 ) ln|S 2 | 

= | X'Z*-, 1 X - M* Ei 1 X + \ M, + 1 \n\Xi \ 

X'ZgX + M 2 Z 2 X - |m 2 Z 2 M 2 - \ ln|Z 2 | 

= ^X , (Z'i 1 -2 2 1 )X - (Mizi-MzS'^X 
+ \ - M 2 Z 2 M 2 ) + \ InJI^J 
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-IxW-ZsIX-lNlU',’ -M&'lX+c 

l l 2 | 

where c= \ - U&2 M 2 ) + 2 ln j^j 

Let X 0 =uV+ V 0 where u is a scalar. Then h(X 0 )=h(uV+ V 0 ) is given by 
h(uV+ V 0 ) 

= 1 (uV+ V 0 )'(I‘,' - Z 2 )(uV+ V 0 ) - (M'l', 1 - M 2 Z 2 )(jV+ V 0 ) + c 
. (1 V'dV -Z 2 )V}«u 2 + (Vo'dV - Ii')V}.u + \ V 0 W - l2)V 0 
- ((M 1 ,!, 1 -M 2 Z 2 )V}*u - (M'jZi 1 -M 2 Z 2 )V 0 +c 
= (1 V'(Z',’ - z 2 )V).u 2 + (Vo'flV - z 2 )V - (M^V - + c' 

where tf = \ V 0 '(I',’ - Z 2 )V 0 - (M 1 ,!', 1 - m' 2 Z 2 ')V 0+ c 

Let a = 5 V(S',' - 2?)V, b = Vo'fZ',’ - Z 2 ')V - (M'.Z ', 1 - Then 

h(uV+ V 0 ) = a*u 2 + b«u + c’ = t 
Let f(u)=h(uV+ V 0 ) - 1 = a«u 2 + b»u + c' - 1 
Then the solutions to f(u)=0 are given by 

t-c' 

If a = 0, u - b ~ 

-b ± V b 2 - 4a(c' - t) 

If 3 ^ 0, u = 

Therefore the point which is on the straight line uV+ V 0 and on the decision 
boundary h(X)=t can be given by 

X 0 =uV+ V 0 

t - c ' n 
where u= ^ if a = 0 , 
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u = 


-b ± Vb 2 - 4a(c' - t) 




and 0 <u £ 1 if a * 0, 




2a 

a^V'a,' - LaV. 

b = V 0 W - Ii’)V - (Mil/ - M 2 I 2 ')V, 

C - g V 0 '(Si' - S 2 1 )V 0 - (M',2'1 1 - M 2 S 2 )V 0 + 0, 


C = 2 (M*Ei M, - M 2 Z 2 M 2 ) + 2 ln 


1^1 


Equation (A.2) can be used to calculate the point on the decision boundary from 
two samples classified differently and equation (A.1) can be used to calculate a 
normal vector to the decision boundary. 


Example A.1 Assuming that a Gaussian ML classifier is used, the mean vectors 
and covariance matrices of two classes are given as follows: 


' 1 ' 


" 1 0.51 

.- 1 . 

» 2 1 = 

. 0.5 1 J 

*-f 


' 1 0.51 

. 1 . 

• £2 = 

. 0.5 1 J 


P(co 1 ) = P^) = 0.5 


The inverses and determinants of L-| and Z 2 are given by 


z 

z 


-1 4 

1 ”3 

-1 i 

2 “3 


' 1 - 0 . 5 " 
- 0.5 1 

' 1 - 0 . 5 " 
- 0.5 1 


, det^) = 0.75 


, det(Z 2 ) = 0.75 


Let P 1 =(1,0) and P 2 =(1,2)be points on the different sides of the decision 
boundary. Then the equation of a straight line connecting the two points is given 
as follows: 


uV+ V 0 where V = 
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Then the point(s) where the decision boundary and the straight line meets are 
given by 


X = uV + V 0 


t — c' c' 
where u= — ^ — = - b’ 

b = VqHi'i 1 - 2*)V - (M^V - M^JV, 

c' = ^ Vq^L ] 1 - l2) v o “ (M* S, 1 - M 2^2* c - 

1 i 21 1 

C = |(M t 1 Zi 1 M t - M2Z2 M 2 ) + 2 ln j^j 


Therefore, 

c 


= l(M t 1 Z> 1 -M2Z 2 1 M 2 ) 


4b -'lUsTI-iH- 1 ’iU.sl'Ti 1 ]) 


= 0 

= -(M t 1 Zi 1 -M^Vo 




= -1 .5 -1 .5 = -3 
= - (Mill 1 - M^V 


r -iT -0.5X0] r 1 1 i(" 1 -0.5T0] 

= -[ 1 _1 ] L-o.5 1 I2 J + r 1 ^L-o.s 1 12 J 


=3+3=6 

u = 0.5 

X =0.5V+ V 0 = 0.5j 


BM1] 
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Figure A.1 Solution of h(uV+ V 0 )=t where u needs to be found. 
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Appendix B 

This appendix contains source code listings for the algorithms involved, due to 
its length it has not been included in all copies of this report. It is available upon 
request to the authors. 
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